PyPI - patchvec - Versions diffs - 0.5.8__tar.gz → 0.5.8.1__tar.gz - Mend

patchvec 0.5.8tar.gz → 0.5.8.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

{patchvec-0.5.8/patchvec.egg-info → patchvec-0.5.8.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: patchvec
-Version: 0.5.8
+Version: 0.5.8.1
 Summary: Patchvec — A lightweight, pluggable vector search microservice.
 Author: Rodrigo Rodrigues da Silva
 Author-email: rodrigo@flowlexi.com

{patchvec-0.5.8 → patchvec-0.5.8.1}/README.md RENAMED Viewed

@@ -1,31 +1,39 @@
 <!-- (C) 2025, 2026 Rodrigo Rodrigues da Silva <rodrigo@flowlexi.com> -->
 <!-- SPDX-License-Identifier: AGPL-3.0-or-later -->
-# 🍰 PatchVec — Lightweight, Pluggable Vector Search Microservice
-Patchvec is a compact vector store built for people who want provenance and fast
-iteration on RAG plumbing. No black boxes, no hidden pipelines: every chunk records
-document id, page, and byte offsets, and you can swap embeddings or storage backends per
-collection.
-## ⚙️ Core capabilities
-- **Docker images** — prebuilt CPU/GPU images published to the GitLab Container
-  Registry.
-- **Tenants and collections** — isolation by tenant with per-collection configuration.
-- **Pluggable embeddings** — choose the embedding adapter per collection; wire in local
-  or hosted models.
-- **REST and CLI** — production use over HTTP, quick experiments with the bundled CLI.
-- **Deterministic provenance** — every hit returns doc id, page, offset, and snippet for
-  traceability.
+# 🍰 PatchVec — Vector Search You Can Understand
+PatchVec is a single-process vector search engine that ingests your
+documents, chunks and embeds them, and gives you semantic search with
+full provenance — document id, page, character offset, and the exact
+snippet that matched. No cluster, no managed service, no
+opaque pipelines.
+Drop a file in, search it, see exactly what came back and why.
+## ⚙️ Why PatchVec
+- **Ingest files, not embeddings** — hand it a PDF, CSV, or TXT and
+  PatchVec chunks, embeds, and indexes it. No preprocessing pipeline
+  to build.
+- **Full provenance on every hit** — every search result traces back
+  to a document, page, and character offset. Latency and request
+  traceability are built into every response.
+- **Multi-tenant by default** — tenant/collection namespacing is
+  built in, not bolted on.
+- **REST, CLI, or embed it** — run as an HTTP service, script via
+  the CLI, or import the library directly in your Python app.
+- **Pluggable embeddings** — swap models per collection; wire in
+  local or hosted embedding backends.
 ## 🧭 Workflows
 ### 🐳 Docker workflow (prebuilt images)
-Pull the image that fits your hardware from the [https://gitlab.com/flowlexi](Flowlexi)
-Container Registry on Gitlab (CUDA builds publish as `latest-gpu`, CPU-only as `latest-
-cpu`).
+Pull the image that fits your hardware from the
+[Flowlexi Container Registry](https://gitlab.com/flowlexi/patchvec/container_registry)
+on GitLab (CUDA builds publish as `latest-gpu`, CPU-only as
+`latest-cpu`).
 ```bash
 docker pull registry.gitlab.com/flowlexi/patchvec/patchvec:latest-gpu
@@ -66,20 +74,20 @@ local configuration directory.
 **Requires Python 3.10–3.14.**
 ```bash
-mkdir -p ~/pv && cd ~/pv #or wherever
+mkdir -p ~/pv && cd ~/pv  # or wherever
 python -m venv .venv-pv
 source .venv-pv/bin/activate
 python -m pip install --upgrade pip
 pip install "patchvec[cpu]"
 # grab the default configs
-curl -LO https://raw.githubusercontent.com/patchvec/patchvec/main/config.yml.example
-curl -LO https://raw.githubusercontent.com/patchvec/patchvec/main/tenants.yml.example
+curl -LO https://raw.githubusercontent.com/rodrigopitanga/patchvec/main/config.yml.example
+curl -LO https://raw.githubusercontent.com/rodrigopitanga/patchvec/main/tenants.yml.example
 cp config.yml.example config.yml
 cp tenants.yml.example tenants.yml
 # sample demo corpus
-curl -LO https://raw.githubusercontent.com/patchvec/patchvec/main/demo/20k_leagues.txt
+curl -LO https://raw.githubusercontent.com/rodrigopitanga/patchvec/main/demo/20k_leagues.txt
 # point Patchvec at the config directory and set a local admin key
 export PATCHVEC_CONFIG="$HOME/pv/config.yml"
@@ -129,8 +137,34 @@ curl -H "Authorization: Bearer $PATCHVEC_GLOBAL_KEY" \
   "http://localhost:8086/collections/demo/books/search?q=captain+nemo&k=3"
 ```
-There is a simple Swagger UI available at the root of the server. Just point your
-browser to `http://localhost:8086/`
+Every hit comes back with provenance you can trace, plus latency
+and request id for observability:
+```json
+{
+  "matches": [
+    {
+      "id": "verne-20k::chunk_42",
+      "score": 0.82,
+      "text": "Captain Nemo conducted me to the central staircase ...",
+      "tenant": "demo",
+      "collection": "books",
+      "match_reason": "semantic",
+      "meta": {
+        "docid": "verne-20k",
+        "filename": "20k_leagues.txt",
+        "offset": 21000,
+        "lang": "en",
+        "ingested_at": "2026-03-07T12:00:00Z"
+      }
+    }
+  ],
+  "latency_ms": 12.4,
+  "request_id": "req-5f3a-b812"
+}
+```
+The Swagger UI is available at `http://localhost:8086/`.
 Health and metrics endpoints are available at `/health` and `/metrics`.
@@ -147,10 +181,15 @@ though), or explicitly delete the document and then ingest it again.
 CLI (re-ingest to replace):
 ```bash
-pavecli ingest demo books demo/20k_leagues.txt --docid=verne-20k
-cp demo/20k_leagues.txt demo/20k_leagues_mod.txt
-echo "THE END" >> demo/20k_leagues_mod.txt
+# initial ingest
 pavecli ingest demo books 20k_leagues.txt --docid=verne-20k
+# modify the content (filename can change — docid is what matters)
+cp 20k_leagues.txt 20k_leagues_v2.txt
+echo "THE END" >> 20k_leagues_v2.txt
+# re-ingest with the same docid to replace the indexed content
+pavecli ingest demo books 20k_leagues_v2.txt --docid=verne-20k
 ```
 REST (delete then ingest):
@@ -169,37 +208,24 @@ curl -H "Authorization: Bearer $PATCHVEC_GLOBAL_KEY" \
 ### 🛠️ Developer workflow
-Building from source relies on the `Makefile` shortcuts (`make install-dev`, `USE_CPU=1
-make serve`, `make test`, etc.). The full contributor workflow, target reference, and
-task claiming rules live in [CONTRIBUTING.md](CONTRIBUTING.md). Performance benchmarks
-are documented in [README-benchmarks.md](README-benchmarks.md).
+Building from source relies on `Makefile` shortcuts (`make install-dev`,
+`USE_CPU=1 make serve`, `make test`, `make check`, etc.).
+The full contributor workflow, target reference, and task claiming rules live in
+[CONTRIBUTING.md](CONTRIBUTING.md). Performance benchmarks are documented in
+[README-benchmarks.md](README-benchmarks.md).
 ## Logging
-PatchVec emits two independent log streams.
-**Dev stream** (stderr, always on): human-readable text, colored in TTY.
-Controlled by `log.level` in `config.yml` (`DEBUG`, `INFO`, `WARNING`; default
-`INFO`). Namespace-level overrides:
+PatchVec writes human-readable logs to stderr and optionally emits
+structured JSON lines (one per search/ingest/delete) for production
+observability. Enable the ops stream in `config.yml`:
 ```yaml
 log:
-  level: INFO
-  debug: [pave.stores]   # force DEBUG for specific namespaces
-  watch: [txtai]         # one level more verbose than base
-  quiet: [uvicorn]       # one level quieter (uvicorn is quieted by default)
+  ops_log: stdout   # null (off) | stdout | /path/to/ops.jsonl
 ```
-**Ops stream**: one JSON line per operation —
-search, ingest, delete, rename — written to a configurable destination. Off by
-default; `stdout` is recommended for Docker/12-factor deployments (PatchVec
-already uses `stderr` for the dev stream):
-```yaml
-log:
-  ops_log: null          # null (off) | stdout | /path/to/ops.jsonl
-  access_log: null       # uvicorn access log: null (off) | stdout | /path
-```
+See `config.yml.example` for the full logging configuration.
 ## 🗺️ Roadmap

{patchvec-0.5.8 → patchvec-0.5.8.1/patchvec.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: patchvec
-Version: 0.5.8
+Version: 0.5.8.1
 Summary: Patchvec — A lightweight, pluggable vector search microservice.
 Author: Rodrigo Rodrigues da Silva
 Author-email: rodrigo@flowlexi.com

{patchvec-0.5.8 → patchvec-0.5.8.1}/pave/main.py RENAMED Viewed

@@ -40,7 +40,7 @@ from pave.ui import attach_ui
 import pave.log as ops_log
 from pave.log import ops_event
-VERSION = "0.5.8"
+VERSION = "0.5.8.1"
 def _hw_info() -> dict:

{patchvec-0.5.8 → patchvec-0.5.8.1}/pave/preprocess.py RENAMED Viewed

@@ -143,8 +143,9 @@ def preprocess(filename: str, content: bytes, csv_options: dict[str, Any] \
             yield f"page_{i}", text, {"page": i}
     elif ext == "txt":
         text = content.decode("utf-8", errors="ignore")
+        step = max(TXT_CHUNK_SIZE - TXT_CHUNK_OVERLAP, 1)
         for i, chunk in enumerate(_chunks(text)):
-            yield f"chunk_{i}", chunk, {"chunk": i}
+            yield f"chunk_{i}", chunk, {"offset": i * step}
     elif ext == "csv" or mt == "text/csv":
         yield from _preprocess_csv(filename, content, csv_options or {})
         return

{patchvec-0.5.8 → patchvec-0.5.8.1}/pave/service.py RENAMED Viewed

@@ -160,17 +160,19 @@ def ingest_document(store, tenant: str, collection: str, filename: str, content:
             if baseid and store.has_doc(tenant, collection, baseid):
                 purged = store.purge_doc(tenant, collection, baseid)
                 m_inc("purge_total", purged)
-            meta_doc = metadata or {}
+            meta_from_call = metadata or {}
+            now = datetime.now(tz.utc).isoformat(timespec="seconds")
+            now = now.replace("+00:00", "Z")
+            doc_meta = {
+                "docid": baseid, "filename": filename,
+                "ingested_at": now, **meta_from_call,
+            }
             records = []
             for local_id, text, extra in preprocess(
                 filename, content, csv_options=csv_options
             ):
                 rid = f"{baseid}::{local_id}"
-                now = datetime.now(tz.utc).isoformat(timespec="seconds")
-                now = now.replace("+00:00", "Z")
-                meta = {"docid": baseid, "filename": filename, "ingested_at": now}
-                meta.update(meta_doc)
-                meta.update(extra)
+                meta = {**doc_meta, **extra}
                 records.append((rid, text, meta))
             if not records:
                 return {
@@ -178,7 +180,7 @@ def ingest_document(store, tenant: str, collection: str, filename: str, content:
                     "code": "no_text_extracted",
                     "error": "no text extracted",
                 }
-            count = store.index_records(tenant, collection, baseid, records)
+            count = store.index_records(tenant, collection, baseid, records, doc_meta)
             m_inc("documents_indexed_total", 1.0)
             m_inc("chunks_indexed_total", float(count or 0))
             latency_ms = round((_time.perf_counter() - _t0) * 1000, 2)

{patchvec-0.5.8 → patchvec-0.5.8.1}/pave/stores/base.py RENAMED Viewed

@@ -58,7 +58,9 @@ class BaseStore(ABC):
     @abstractmethod
     def index_records(self, tenant: str, collection: str, docid: str,
-                      records: Iterable[Record]) -> int: ...
+                      records: Iterable[Record],
+                      doc_meta: dict[str, Any] | None = None
+                      ) -> int: ...
     @abstractmethod
     def search(self, tenant: str, collection: str, query: str, k: int = 5,

{patchvec-0.5.8 → patchvec-0.5.8.1}/pave/stores/qdrant_store.py RENAMED Viewed

@@ -29,7 +29,8 @@ class QdrantStore(BaseStore):
     def purge_doc(self, tenant: str, collection: str, docid: str) -> int:
         raise NotImplementedError("to be implemented")
-    def index_records(self, tenant: str, collection: str, docid: str, records: Iterable[Record]) -> int:
+    def index_records(self, tenant: str, collection: str, docid: str,
+                      records: Iterable[Record], doc_meta: dict | None = None) -> int:
         raise NotImplementedError("to be implemented")
     def search(self, tenant: str, collection: str, text: str, k: int = 5,

{patchvec-0.5.8 → patchvec-0.5.8.1}/pave/stores/txtai_store.py RENAMED Viewed

@@ -525,7 +525,9 @@ class TxtaiStore(BaseStore):
         return None
     def index_records(self, tenant: str, collection: str, docid: str,
-                      records: Iterable[Record]) -> int:
+                      records: Iterable[Record],
+                      doc_meta: dict[str, Any] | None = None
+                      ) -> int:
         """
         Ingests records as (rid, text, meta). Guarantees non-null text, coerces
         dict-records, updates SQLite metadata, saves index. Thread critical.
@@ -537,7 +539,6 @@ class TxtaiStore(BaseStore):
             em = self._emb[key]
             prepared: list[tuple[str, Any, str]] = []
             chunk_rows: list[tuple[str, str | None, dict[str, Any]]] = []
-            doc_meta: dict[str, Any] = {}
             for r in records:
                 if isinstance(r, dict):
@@ -569,12 +570,6 @@ class TxtaiStore(BaseStore):
                             md = {}
                 md["docid"] = docid
-                # Capture first occurrence of doc-level meta fields
-                if not doc_meta:
-                    doc_meta = {
-                        k: v for k, v in md.items()
-                        if k not in ("chunk", "page", "position", "section")
-                    }
                 try:
                     safe_meta = self._sanit_meta_dict(md)

{patchvec-0.5.8 → patchvec-0.5.8.1}/setup.py RENAMED Viewed

@@ -17,7 +17,7 @@ long_description, long_type = read_long_description()
 setup(
     name="patchvec",                       # external name
-    version="0.5.8",
+    version="0.5.8.1",
     description="Patchvec — A lightweight, pluggable vector search microservice.",
     long_description=long_description,
     long_description_content_type="text/markdown",

{patchvec-0.5.8 → patchvec-0.5.8.1}/tests/test_cli.py RENAMED Viewed

@@ -32,6 +32,37 @@ def test_cli_ingest_on_fresh_collection_with_empty_index_dir(cli_env, tmp_path):
                and c[3] == "DOC1" for c in store.calls)
     assert ("save", tenant, coll) in store.calls
+def test_cli_ingest_passes_doc_meta_through_wrapper(cli_env, tmp_path):
+    pvcli, store, _ = cli_env
+    tenant, coll = "acme", "metawrap"
+    sample = tmp_path / "meta.txt"
+    sample.write_text("conteúdo de teste", encoding="utf-8")
+    pvcli.main_cli(["create-collection", tenant, coll])
+    pvcli.main_cli(
+        [
+            "ingest", tenant, coll, str(sample),
+            "--docid", "DOCMETA",
+            "--metadata", '{"lang":"pt","source":"cli"}',
+        ]
+    )
+    calls = [
+        c for c in store.calls
+        if c[0] == "index_records" and c[1] == tenant
+        and c[2] == coll and c[3] == "DOCMETA"
+    ]
+    assert calls
+    doc_meta = calls[-1][5]
+    assert isinstance(doc_meta, dict)
+    assert doc_meta["docid"] == "DOCMETA"
+    assert doc_meta["lang"] == "pt"
+    assert doc_meta["source"] == "cli"
+    assert doc_meta["filename"].endswith("meta.txt")
+    assert doc_meta["ingested_at"].endswith("Z")
 def test_cli_reingest_same_docid_triggers_purge(cli_env, tmp_path):
     pvcli, store, _ = cli_env
     tenant, coll = "acme", "reupcli"

{patchvec-0.5.8 → patchvec-0.5.8.1}/tests/test_txtai_store.py RENAMED Viewed

@@ -99,6 +99,36 @@ def test_meta_json_and_filters(store):
     # Ensure meta was JSON-encoded internally (FakeEmbeddings asserts this)
+def test_doc_level_meta_persists_in_documents_table(store):
+    tenant, coll, docid = "acme", "docmeta", "DOCMETA"
+    recs = [
+        {
+            "id": "0",
+            "content": "Documento com metadados.",
+            "metadata": {"lang": "pt", "chunk": 0},
+        },
+    ]
+    doc_meta = {
+        "docid": docid,
+        "filename": "meta.txt",
+        "lang": "pt",
+        "source": "api",
+    }
+    n = store.index_records(tenant, coll, docid, recs, doc_meta=doc_meta)
+    assert n == 1
+    col_db = store.impl._dbs[(tenant, coll)]
+    conn = col_db._conn
+    assert conn is not None
+    row = conn.execute(
+        "SELECT meta_json FROM documents WHERE docid=?",
+        (docid,),
+    ).fetchone()
+    assert row is not None and row[0]
+    assert json.loads(row[0]) == doc_meta
 def test_purge_doc_removes_ids(store):
     recs = [
         {"id": "y::0", "content": "primeiro", "metadata": {}},