PyPI - symbolicai - Versions diffs - 1.3.0__tar.gz → 1.5.0__tar.gz - Mend

symbolicai 1.3.0tar.gz → 1.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (267) hide show

{symbolicai-1.3.0 → symbolicai-1.5.0}/AGENTS.md RENAMED Viewed

@@ -66,7 +66,7 @@ CLI entrypoints (after install): `symchat`, `symsh`, `symconfig`, `symserver`.
 - Treat type hints as contracts; do not add runtime type checks except at trust boundaries (CLI/env, JSON/network, disk).
 - Prefer minimal diffs; edit existing code over adding new files unless necessary.
 - Do not add/modify `tests/` or run tests unless explicitly requested; if requested, run the narrowest relevant `pytest` command.
-- When you change Python files: run `ruff check <changed_files> --output-format concise --config ruff.toml` and fix issues.
+- When you change Python files outside `tests/`: run `ruff check <changed_files> --output-format concise --config ruff.toml` and fix issues.
 - Keep search local-first (`rg`); follow imports instead of repo-wide “random scanning”.
 - If adding a regex, include a short comment explaining what it matches.
 - Update `TODO.md` when tasks are completed, added, or re-scoped.

{symbolicai-1.3.0 → symbolicai-1.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: symbolicai
-Version: 1.3.0
+Version: 1.5.0
 Summary: A Neurosymbolic Perspective on Large Language Models
 Author-email: Marius-Constantin Dinu <marius@extensity.ai>, Leoveanu-Condrei Claudiu <leo@extensity.ai>
 License: BSD 3-Clause License
@@ -113,6 +113,7 @@ Requires-Dist: openai-whisper>=20240930; extra == "whisper"
 Requires-Dist: numba>=0.62.1; extra == "whisper"
 Requires-Dist: llvmlite>=0.45.1; extra == "whisper"
 Provides-Extra: search
+Requires-Dist: firecrawl-py>=4.12.0; extra == "search"
 Requires-Dist: parallel-web>=0.3.3; extra == "search"
 Provides-Extra: serpapi
 Requires-Dist: google_search_results>=2.4.2; extra == "serpapi"
@@ -136,6 +137,8 @@ Requires-Dist: symbolicai[serpapi]; extra == "all"
 Requires-Dist: symbolicai[services]; extra == "all"
 Requires-Dist: symbolicai[solver]; extra == "all"
 Requires-Dist: symbolicai[qdrant]; extra == "all"
+Provides-Extra: dev
+Requires-Dist: pytest-asyncio>=1.3.0; extra == "dev"
 Dynamic: license-file
 # **SymbolicAI: A neuro-symbolic perspective on LLMs**

{symbolicai-1.3.0 → symbolicai-1.5.0}/docs/source/ENGINES/indexing_engine.md RENAMED Viewed

@@ -31,19 +31,22 @@ The Qdrant engine provides a production-ready vector database for scalable RAG a
 ### Setup
-#### Option 1: Local Qdrant Server
+#### Option 1: Local Qdrant Server (via symserver)
-Start a local Qdrant server using the built-in wrapper:
+Start Qdrant using the `symserver` CLI (Docker by default).
 ```bash
-# Using Docker (default)
-python -m symai.server.qdrant_server
+# Pull the image once (recommended)
+docker pull qdrant/qdrant:latest
-# Using Qdrant binary
-python -m symai.server.qdrant_server --mode binary --binary-path /path/to/qdrant
+# Docker (default): set INDEXING_ENGINE so symserver selects Qdrant
+INDEXING_ENGINE=qdrant symserver --host 0.0.0.0 --port 6333 --storage-path ./qdrant_storage
-# Custom configuration
-python -m symai.server.qdrant_server --host 0.0.0.0 --port 6333 --storage-path ./qdrant_storage
+# Use native binary
+INDEXING_ENGINE=qdrant symserver --env binary --binary-path /path/to/qdrant --port 6333 --storage-path ./qdrant_storage
+# Detach Docker if desired
+INDEXING_ENGINE=qdrant symserver --docker-detach
 ```
 #### Option 2: Cloud Qdrant
@@ -103,6 +106,43 @@ async def basic_usage():
 asyncio.run(basic_usage())
 ```
+### Local Search with citations
+If you need citation-formatted results compatible with `parallel.search`, use the `local_search` interface. It embeds the query locally, queries Qdrant, and returns a `SearchResult` (with `value` and `citations`) instead of raw `ScoredPoint` objects:
+Local search accepts the same args as passed to Qdrant directly: `collection_name`/`index_name`, `limit`/`top_k`/`index_top_k`, `score_threshold`, `query_filter` (dict or Qdrant `Filter`), and any extra Qdrant search kwargs. Citation fields are derived from Qdrant payloads: the excerpt uses `payload["text"]` (or `content`), the URL is resolved from `payload["source"]`/`url`/`file_path`/`path` and is always returned as an absolute `file://` URI (relative inputs resolve against the current working directory), and the title is the stem of that path (PDF pages append `#p{page}` when provided). Each matching chunk yields its own citation; multiple citations can point to the same file.
+If you want a stable source header for each chunk, store a `source_id` or `chunk_id` in the payload (otherwise the Qdrant point id is used).
+Example:
+```python
+from symai.interfaces import Interface
+from qdrant_client.http import models
+search = Interface("local_search", index_name="my_collection")
+qdrant_filter = models.Filter(
+    must=[
+        models.FieldCondition(key="category", match=models.MatchValue(value="AI"))
+    ]
+)
+result = search.search(
+    "neural networks and transformers",
+    collection_name="my_collection",   # alias: index_name
+    limit=5,                           # aliases: top_k, index_top_k
+    score_threshold=0.35,
+    query_filter=qdrant_filter,        # or a simple dict like {"category": "AI"}
+    with_payload=True,                 # passed through to Qdrant query_points
+    with_vectors=False,                # optional; defaults follow engine config
+    # any other Qdrant query_points kwargs can be added here
+)
+print(result.value)          # formatted text with [1], [2] markers
+print(result.get_citations())  # list of Citation objects
+```
 ### Collection Management
 Create and manage collections programmatically:
@@ -156,6 +196,8 @@ async def add_documents():
         document_path="/path/to/document.pdf",
         metadata={"source": "document.pdf"}
     )
+    # Note: document_path indexing stores the absolute file path in payload["source"]
+    # so local_search citations resolve to file:// URIs.
     # Chunk and index from a URL
     num_chunks = await engine.chunk_and_upsert(

symbolicai-1.5.0/docs/source/ENGINES/scrape_engine.md ADDED Viewed

@@ -0,0 +1,143 @@
+# Scrape Engine
+## Naive Scrape
+To access data from the web, we can use the `naive_scrape` interface. The engine underneath is very lightweight and can be used to scrape data from websites. It is based on the `requests` library, as well as `trafilatura` for output formatting, and `bs4` for HTML parsing. `trafilatura` currently supports the following output formats: `json`, `csv`, `html`, `markdown`, `text`, `xml`
+```python
+from symai.interfaces import Interface
+scraper = Interface("naive_scrape")
+url = "https://docs.astral.sh/uv/guides/scripts/#next-steps"
+res = scraper(url)
+```
+## Parallel (Parallel.ai)
+The Parallel.ai integration routes scrape calls through the official `parallel-web` SDK and can handle PDFs, JavaScript-heavy feeds, and standard HTML pages in the same workflow. Instantiate the Parallel interface and call `.scrape(...)` with the target URL. The engine detects scrape requests automatically whenever a URL is supplied.
+```python
+from symai.extended import Interface
+scraper = Interface("parallel")
+article = scraper.scrape(
+    "https://trafilatura.readthedocs.io/en/latest/crawls.html",
+    full_content=True,           # optional: request full document text
+    excerpts=True,               # optional: default True, retain excerpt snippets
+    objective="Summarize crawl guidance for internal notes."
+)
+print(str(article))
+```
+Configuration requires a Parallel API key and the Parallel model token. Add the following to your settings:
+```bash
+{
+    …
+    "SEARCH_ENGINE_API_KEY": "…",
+    "SEARCH_ENGINE_MODEL": "parallel"
+    …
+}
+```
+When invoked with a URL, the engine hits Parallel's Extract API and returns an `ExtractResult`. The result string joins excerpts or the full content if requested. Because processing is offloaded to Parallel's hosted infrastructure, the engine remains reliable on dynamic pages that the naive scraper cannot render. Install the dependency with `pip install parallel-web` before enabling this engine.
+## Firecrawl
+Firecrawl.dev specializes in reliable web scraping with automatic handling of JavaScript-rendered content, proxies, and anti-bot mechanisms. It converts web pages into clean formats suitable for LLM consumption and supports advanced features like actions, caching, and location-based scraping.
+### Examples
+```python
+from symai.extended import Interface
+scraper = Interface("firecrawl")
+# Example 1: Basic webpage scraping
+content = scraper.scrape(
+    "https://docs.firecrawl.dev/introduction",
+    formats=["markdown"]
+)
+print(content)
+# Example 2: PDF scraping with content extraction and trimming
+pdf_full = scraper.scrape(
+    "https://pmc.ncbi.nlm.nih.gov/articles/PMC7231600",
+    only_main_content=True,
+    formats=["markdown"],
+    proxy="auto"
+)
+# Trim locally if needed
+pdf_trimmed = str(pdf_full)[:100]
+# Note: JS-heavy sites like Twitter/LinkedIn are currently not fully supported
+# They typically return 403 Forbidden errors (may vary by subscription tier)
+```
+### Configuration
+Enable the engine by configuring Firecrawl credentials:
+```bash
+{
+    "SEARCH_ENGINE_API_KEY": "fc-your-api-key",
+    "SEARCH_ENGINE_MODEL": "firecrawl"
+}
+```
+### JSON Schema Extraction
+Firecrawl supports structured data extraction using JSON schemas. This is useful for extracting specific fields from web pages using LLM-powered extraction:
+```python
+from pydantic import Field
+from symai.extended import Interface
+from symai.models import LLMDataModel
+class MetadataModel(LLMDataModel):
+    """Bibliographic metadata extracted from a source document."""
+    title: str = Field(description="Title of the source.")
+    year: str = Field(description="Publication year (4 digits) or Unknown.")
+    authors: list[str] = Field(default_factory=list, description="List of authors.")
+    doi: str | None = Field(default=None, description="DOI if available.")
+# Build JSON format config from Pydantic schema
+schema = MetadataModel.model_json_schema()
+json_format = {
+    "type": "json",
+    "prompt": "Extract bibliographic metadata from this academic paper.",
+    "schema": schema,
+}
+scraper = Interface("firecrawl")
+result = scraper.scrape(
+    "https://journals.physiology.org/doi/full/10.1152/ajpregu.00051.2002",
+    formats=[json_format],
+    proxy="auto"
+)
+# Access extracted data as dict
+extracted = result.raw["json"]
+metadata = MetadataModel(**extracted)
+print(metadata.model_dump())
+# Or as JSON string
+print(str(result))
+```
+### Supported Parameters
+The engine supports many parameters (passed as kwargs). Common ones include:
+- **formats**: Output formats (["markdown"], ["html"], ["rawHtml"])
+- **only_main_content**: Extract main content only (boolean)
+- **proxy**: Proxy mode ("basic", "stealth", "auto")
+- **location**: Geographic location object with optional country and languages
+  - Example: `{"country": "US"}` or `{"country": "RO", "languages": ["ro"]}`
+- **maxAge**: Cache duration in seconds (integer)
+- **storeInCache**: Enable caching (boolean)
+- **actions**: Page interactions before scraping (list of action objects)
+  - Example: `[{"type": "wait", "milliseconds": 2000}]`
+  - Example: `[{"type": "click", "selector": ".button"}]`
+  - Example: `[{"type": "scroll", "direction": "down", "amount": 500}]`
+Check the Firecrawl v2 API documentation for the complete list of available parameters and action types.

{symbolicai-1.3.0 → symbolicai-1.5.0}/docs/source/ENGINES/search_engine.md RENAMED Viewed

@@ -152,3 +152,75 @@ Here's how to configure the OpenAI search engine:
 ```
 This engine calls the OpenAI Responses API under the hood. When you target a reasoning-capable model, pass a `reasoning` dictionary matching the Responses payload schema (for example `{"effort": "low", "summary": "auto"}`). If omitted, the engine falls back to the default effort/summary settings shown above.
+## Firecrawl
+Firecrawl.dev provides web scraping and search capabilities with built-in handling of dynamic JavaScript content and anti-bot mechanisms. The engine converts web pages into clean markdown and can perform web searches across multiple sources with advanced filtering and content extraction.
+### Comprehensive Search Example
+```python
+from symai.extended import Interface
+engine = Interface("firecrawl")
+# Example 1: Location-aware search with language, scraping, and citations
+result = engine.search(
+    "who is nicusor dan",
+    limit=5,
+    location="Romania",
+    lang="ro",
+    sources=["web"],
+    formats=["markdown"],
+    only_main_content=True,
+    proxy="stealth"
+)
+# Access structured citations (similar to parallel.ai)
+citations = result.get_citations()
+for citation in citations:
+    print(f"[{citation.id}] {citation.title}: {citation.url}")
+# Example 2: Domain-filtered search with character limits
+domains = ["arxiv.org", "nature.com"]
+filters = " OR ".join(f"site:{domain}" for domain in domains)
+query = f"({filters}) what is thermodynamic computing"
+result = engine.search(
+    query,
+    limit=10,
+    max_chars_per_result=500,
+    categories=["research"],
+    formats=["markdown"],
+    proxy="basic"
+)
+print(result)
+```
+### Configuration
+Enable the engine by configuring Firecrawl credentials:
+```bash
+{
+    "SEARCH_ENGINE_API_KEY": "fc-your-api-key",
+    "SEARCH_ENGINE_MODEL": "firecrawl"
+}
+```
+### Supported Parameters
+The engine supports many parameters (passed as kwargs). Common ones include:
+- **limit**: Max number of results
+- **location**: Country code string for search (e.g., "Romania", "Germany")
+- **lang**: Language code string for search (e.g., "ro", "es") - hint, not enforcement
+- **sources**: List of sources (["web"], ["news"], ["images"])
+- **categories**: Content types (["research"], ["github"], ["pdf"])
+- **tbs**: Time-based filter (e.g., "qdr:d" for past day)
+- **formats**: Output formats for scraped content (["markdown"], ["html"])
+- **only_main_content**: Extract main content only when scraping (boolean)
+- **max_chars_per_result**: Truncate results locally (integer)
+- **proxy**: Proxy mode for scraping ("basic", "stealth", "auto")
+- **scrape_location**: Location object for scraping with optional country and languages
+  - Example: `{"country": "US"}` or `{"country": "RO", "languages": ["ro"]}`
+Check the Firecrawl v2 API documentation for the complete list of available parameters.

{symbolicai-1.3.0 → symbolicai-1.5.0}/pyproject.toml RENAMED Viewed

@@ -78,7 +78,7 @@ scrape       = ["beautifulsoup4>=4.12.3", "trafilatura>=2.0.0", "pdfminer.six",
 llama_cpp    = ["llama-cpp-python[server]>=0.3.7"] # handle separately since this dependency may not compile and require special maintenance
 wolframalpha = ["wolframalpha>=5.0.0"]
 whisper      = ["openai-whisper>=20240930", "numba>=0.62.1", "llvmlite>=0.45.1"]
-search       = ["parallel-web>=0.3.3"]
+search       = ["firecrawl-py>=4.12.0", "parallel-web>=0.3.3"]
 serpapi      = ["google_search_results>=2.4.2"]
 services     = ["fastapi>=0.110.0", "redis>=5.0.2", "uvicorn>=0.27.1"]
 solver       = ["z3-solver>=4.12.6.0"]
@@ -94,6 +94,9 @@ all          = [
     "symbolicai[solver]",
     "symbolicai[qdrant]"
 ]
+dev = [
+    "pytest-asyncio>=1.3.0",
+]
 [tool.setuptools.dynamic]
 version = {attr = "symai.SYMAI_VERSION"}

{symbolicai-1.3.0 → symbolicai-1.5.0}/symai/__init__.py RENAMED Viewed

@@ -33,7 +33,7 @@ os.environ["TOKENIZERS_PARALLELISM"] = "false"
 # Create singleton instance
 config_manager = settings.SymAIConfig()
-SYMAI_VERSION = "1.3.0"
+SYMAI_VERSION = "1.5.0"
 __version__ = SYMAI_VERSION
 __root_dir__ = config_manager.config_dir

{symbolicai-1.3.0 → symbolicai-1.5.0}/symai/backend/engines/index/engine_qdrant.py RENAMED Viewed

@@ -4,8 +4,10 @@ import tempfile
 import urllib.request
 import uuid
 import warnings
+from dataclasses import dataclass
 from pathlib import Path
 from typing import Any
+from urllib.parse import urlparse
 import numpy as np
@@ -148,6 +150,108 @@ Matches:
         return f"<ul>{doc_str}</ul>"
+@dataclass
+class Citation:
+    id: int
+    title: str
+    url: str
+    start: int
+    end: int
+    def __hash__(self):
+        return hash((self.url,))
+class SearchResult(Result):
+    def __init__(self, value: dict[str, Any] | Any, **kwargs) -> None:
+        super().__init__(value, **kwargs)
+        if isinstance(value, dict) and value.get("error"):
+            UserMessage(value["error"], raise_with=ValueError)
+        results = self._coerce_results(value)
+        text, citations = self._build_text_and_citations(results)
+        self._value = text
+        self._citations = citations
+    def _coerce_results(self, raw: Any) -> list[dict[str, Any]]:
+        if raw is None:
+            return []
+        results = raw.get("results", []) if isinstance(raw, dict) else getattr(raw, "results", None)
+        if not results:
+            return []
+        return [item for item in results if isinstance(item, dict)]
+    def _source_identifier(self, item: dict[str, Any], url: str) -> str:
+        for key in ("source_id", "sourceId", "sourceID", "id"):
+            raw = item.get(key)
+            if raw is None:
+                continue
+            text = str(raw).strip()
+            if text:
+                return text
+        path = Path(urlparse(url).path)
+        return path.name or path.as_posix() or url
+    def _build_text_and_citations(self, results: list[dict[str, Any]]):
+        pieces = []
+        citations = []
+        cursor = 0
+        cid = 1
+        separator = "\n\n---\n\n"
+        for item in results:
+            url = str(item.get("url") or "")
+            if not url:
+                continue
+            title = str(item.get("title") or "")
+            if not title:
+                path = Path(urlparse(url).path)
+                title = path.name or url
+            excerpts = item.get("excerpts") or []
+            excerpt_parts = [ex.strip() for ex in excerpts if isinstance(ex, str) and ex.strip()]
+            if not excerpt_parts:
+                continue
+            combined_excerpt = "\n\n".join(excerpt_parts)
+            source_id = self._source_identifier(item, url)
+            block_body = combined_excerpt if not source_id else f"{source_id}\n\n{combined_excerpt}"
+            if pieces:
+                pieces.append(separator)
+                cursor += len(separator)
+            opening_tag = "<source>\n"
+            pieces.append(opening_tag)
+            cursor += len(opening_tag)
+            pieces.append(block_body)
+            cursor += len(block_body)
+            closing_tag = "\n</source>"
+            pieces.append(closing_tag)
+            cursor += len(closing_tag)
+            marker = f"[{cid}]"
+            start = cursor
+            pieces.append(marker)
+            cursor += len(marker)
+            citations.append(Citation(id=cid, title=title or url, url=url, start=start, end=cursor))
+            cid += 1
+        return "".join(pieces), citations
+    def __str__(self) -> str:
+        return str(self._value or "")
+    def _repr_html_(self) -> str:
+        return f"<pre>{self._value or ''}</pre>"
+    def get_citations(self) -> list[Citation]:
+        return self._citations
 class QdrantIndexEngine(Engine):
     _default_url = "http://localhost:6333"
     _default_api_key = SYMAI_CONFIG.get("INDEXING_ENGINE_API_KEY", None)
@@ -421,15 +525,18 @@ class QdrantIndexEngine(Engine):
             kwargs["index_get"] = True
             self._configure_collection(**kwargs)
+        treat_as_search_engine = False
         if operation == "search":
             # Ensure collection exists - fail fast if it doesn't
             self._ensure_collection_exists(collection_name)
-            index_top_k = kwargs.get("index_top_k", self.index_top_k)
+            search_kwargs = dict(kwargs)
+            index_top_k = search_kwargs.pop("index_top_k", self.index_top_k)
             # Optional search parameters
-            score_threshold = kwargs.get("score_threshold")
+            score_threshold = search_kwargs.pop("score_threshold", None)
             # Accept both `query_filter` and `filter` for convenience
-            raw_filter = kwargs.get("query_filter", kwargs.get("filter"))
+            raw_filter = search_kwargs.pop("query_filter", search_kwargs.pop("filter", None))
             query_filter = self._build_query_filter(raw_filter)
+            treat_as_search_engine = bool(search_kwargs.pop("treat_as_search_engine", False))
             # Use shared search helper that already handles retries and normalization
             rsp = self._search_sync(
@@ -438,6 +545,7 @@ class QdrantIndexEngine(Engine):
                 limit=index_top_k,
                 score_threshold=score_threshold,
                 query_filter=query_filter,
+                **search_kwargs,
             )
         elif operation == "add":
             # Create collection if it doesn't exist (only for write operations)
@@ -462,7 +570,10 @@ class QdrantIndexEngine(Engine):
         metadata = {}
-        rsp = QdrantResult(rsp, query, embedding)
+        if operation == "search" and treat_as_search_engine:
+            rsp = self._format_search_results(rsp, collection_name)
+        else:
+            rsp = QdrantResult(rsp, query, embedding)
         return [rsp], metadata
     def prepare(self, argument):
@@ -513,7 +624,33 @@ class QdrantIndexEngine(Engine):
             jitter=self.jitter,
         )
         def _func():
+            qdrant_kwargs = dict(kwargs)
             query_vector_normalized = self._normalize_vector(query_vector)
+            with_payload = qdrant_kwargs.pop("with_payload", True)
+            with_vectors = qdrant_kwargs.pop("with_vectors", self.index_values)
+            # qdrant-client `query_points` is strict about extra kwargs and will assert if any
+            # unknown arguments are provided. Because our engine `forward()` passes decorator
+            # kwargs through the stack, we must drop engine-internal fields here.
+            #
+            # Keep only kwargs that `qdrant_client.QdrantClient.query_points` accepts (besides
+            # those we pass explicitly).
+            if "filter" in qdrant_kwargs and "query_filter" not in qdrant_kwargs:
+                # Convenience alias supported by our public API
+                qdrant_kwargs["query_filter"] = qdrant_kwargs.pop("filter")
+            allowed_qdrant_kwargs = {
+                "using",
+                "prefetch",
+                "query_filter",
+                "search_params",
+                "offset",
+                "score_threshold",
+                "lookup_from",
+                "consistency",
+                "shard_key_selector",
+                "timeout",
+            }
+            qdrant_kwargs = {k: v for k, v in qdrant_kwargs.items() if k in allowed_qdrant_kwargs}
             # For single vector collections, pass vector directly to query parameter
             # For named vector collections, use Query(near_vector=NamedVector(name="vector_name", vector=...))
             # query_points API uses query_filter (not filter) for filtering
@@ -521,9 +658,9 @@ class QdrantIndexEngine(Engine):
                 collection_name=collection_name,
                 query=query_vector_normalized,
                 limit=top_k,
-                with_payload=True,
-                with_vectors=self.index_values,
-                **kwargs,
+                with_payload=with_payload,
+                with_vectors=with_vectors,
+                **qdrant_kwargs,
             )
             # query_points returns QueryResponse with .points attribute, extract it
             return response.points
@@ -860,6 +997,82 @@ class QdrantIndexEngine(Engine):
         # Use _query which handles retry logic and vector normalization
         return self._query(collection_name, query_vector, limit, **search_kwargs)
+    def _resolve_payload_url(
+        self, payload: dict[str, Any], collection_name: str, point_id: Any
+    ) -> str:
+        source = (
+            payload.get("source")
+            or payload.get("url")
+            or payload.get("file_path")
+            or payload.get("path")
+        )
+        if isinstance(source, str) and source:
+            if source.startswith(("http://", "https://", "file://")):
+                return source
+            source_path = Path(source).expanduser()
+            try:
+                resolved = source_path.resolve()
+                if resolved.exists() or source_path.is_absolute():
+                    return resolved.as_uri()
+            except Exception:
+                return str(source_path)
+            return str(source_path)
+        return f"qdrant://{collection_name}/{point_id}"
+    def _resolve_payload_title(self, payload: dict[str, Any], url: str, page: Any) -> str:
+        raw_title = payload.get("title")
+        if isinstance(raw_title, str) and raw_title.strip():
+            base = raw_title.strip()
+        else:
+            parsed = urlparse(url)
+            path_part = parsed.path or url
+            base = Path(path_part).stem or url
+        try:
+            page_int = int(page) if page is not None else None
+        except (TypeError, ValueError):
+            page_int = None
+        if Path(urlparse(url).path).suffix.lower() == ".pdf" and page_int is not None:
+            base = f"{base}#p{page_int}"
+        return base
+    def _format_search_results(self, points: list[ScoredPoint] | None, collection_name: str):
+        results: list[dict[str, Any]] = []
+        for point in points or []:
+            payload = getattr(point, "payload", {}) or {}
+            text = payload.get("text") or payload.get("content")
+            if isinstance(text, list):
+                text = " ".join([t for t in text if isinstance(t, str)])
+            if not isinstance(text, str):
+                continue
+            excerpt = text.strip()
+            if not excerpt:
+                continue
+            page = payload.get("page") or payload.get("page_number") or payload.get("pageIndex")
+            url = self._resolve_payload_url(payload, collection_name, getattr(point, "id", ""))
+            title = self._resolve_payload_title(payload, url, page)
+            results.append(
+                {
+                    "url": url,
+                    "title": title,
+                    "excerpts": [excerpt],
+                    "source_id": payload.get("source_id")
+                    or payload.get("sourceId")
+                    or payload.get("chunk_id")
+                    or payload.get("chunkId")
+                    or getattr(point, "id", None),
+                }
+            )
+        return SearchResult({"results": results})
     async def search(
         self,
         collection_name: str,
@@ -923,7 +1136,7 @@ class QdrantIndexEngine(Engine):
             if tmp_path.exists():
                 tmp_path.unlink()
-    async def chunk_and_upsert(  # noqa: C901
+    async def chunk_and_upsert(
         self,
         collection_name: str,
         text: str | Symbol | None = None,
@@ -1001,8 +1214,7 @@ class QdrantIndexEngine(Engine):
             # Add source to metadata if not already present
             if metadata is None:
                 metadata = {}
-            if "source" not in metadata:
-                metadata["source"] = doc_path.name
+            metadata["source"] = str(doc_path.resolve())
         # Handle document_url: download and read file using FileReader
         elif document_url is not None:

symbolicai 1.3.0__tar.gz → 1.5.0__tar.gz

symbolicai 1.3.0tar.gz → 1.5.0tar.gz