RubyGems - pikuri-vectordb - Versions diffs - 0.0.4 → 0.0.5 - Mend

pikuri-vectordb 0.0.4 → 0.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

checksums.yaml +4 -4
data/README.md +144 -72
data/lib/pikuri/vector_db/backend/chroma.rb +138 -1
data/lib/pikuri/vector_db/backend/in_memory.rb +123 -24
data/lib/pikuri/vector_db/backend/qdrant.rb +446 -0
data/lib/pikuri/vector_db/backend/result.rb +3 -3
data/lib/pikuri/vector_db/backend.rb +37 -7
data/lib/pikuri/vector_db/chunk.rb +15 -10
data/lib/pikuri/vector_db/extension.rb +59 -57
data/lib/pikuri/vector_db/indexer.rb +202 -14
data/lib/pikuri/vector_db/librarian.rb +27 -19
data/lib/pikuri/vector_db/reranker/llama_server.rb +1 -1
data/lib/pikuri/vector_db/reranker.rb +3 -4
data/lib/pikuri/vector_db/server/chroma.rb +184 -0
data/lib/pikuri/vector_db/server/docker_container.rb +266 -0
data/lib/pikuri/vector_db/server/in_memory.rb +98 -0
data/lib/pikuri/vector_db/server/qdrant.rb +177 -0
data/lib/pikuri/vector_db/server.rb +35 -0
data/lib/pikuri/vector_db/tools/read.rb +206 -0
data/lib/pikuri/vector_db/tools/reindex.rb +94 -0
data/lib/pikuri/vector_db/tools/search.rb +202 -0
data/lib/pikuri/vector_db/tools.rb +19 -0
data/lib/pikuri/vector_db/watcher.rb +353 -0
data/lib/pikuri-vectordb.rb +13 -8
data/prompts/persona-librarian.txt +5 -4
data/prompts/pikuri-corpus.txt +21 -0
metadata +35 -12
data/lib/pikuri/vector_db/chroma_server.rb +0 -309
data/lib/pikuri/vector_db/reindex.rb +0 -86
data/lib/pikuri/vector_db/search.rb +0 -201
data/prompts/pikuri-librarian.txt +0 -22

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 3a64900380a7c48712b1ea58053f141d1c6f5da57e723925ca24319899e91607
-  data.tar.gz: 5ed708f037e0cd979e1a2018b9fabf0d996679befbcca69def2d58ce8aac8fd4
+  metadata.gz: 2560026784f6f00e160c090718c8896835b129fb6d81e4efa1570917aa2414d5
+  data.tar.gz: 22e7054b58ae018b1b4afcde583c70ad979aa4e859348892d3a0640e0541c288
 SHA512:
-  metadata.gz: ada1509afe42590bae569b032cc88fe4ae2ed724a152388fc710b3fe5275f256fee2af6e2db29bb82469d11d651698a653addf10a7679df4b57acadaaab76720
-  data.tar.gz: d25268e97f43a4c73fb4eea1e957a3c9129b4d543e3c78e767318cc784eb0898ec3ba647210b6d969521a2214c6934c5f2aadb387e32f98ca655ff5eee0bb92c
+  metadata.gz: 9c957518955fa72da1d998f041422a32b78301e488c3a5e93eb13cda4a779a81ae83b3e9ddd3041c5972973b9684c039760f77fd6c7f82f2e94a177e361d66b8
+  data.tar.gz: 420446ae9178260b0cc9e0f4fa2950c9330a74c298e83aad9b1d76e73809f67a99b9cd460a177353f63560871b2bf412fd5d3d3e7f9d22360a458946e538aabb

data/README.md CHANGED Viewed

@@ -1,80 +1,138 @@
 # pikuri-vectordb
 Local-corpus vector search + agentic RAG for the
-[pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit.
-> **Status:** skeleton — gem scaffolding only. The
-> `Pikuri::VectorDb::Extension` and `vectordb_search` tool are
-> being built in subsequent commits. See `IDEAS.md` §"Vector DB /
-> RAG" for the design.
-Will provide:
-- `Pikuri::VectorDb::Extension` — wires a `vectordb_search` tool +
-  a `vectordb_reindex` tool onto a `Pikuri::Agent` via
-  `c.add_extension(...)` inside the `Agent.new` block.
-- `Pikuri::VectorDb::Backend::InMemory` — pure-Ruby cosine over
-  `Array<Float>`. The educational default; reads in ~40 lines.
-  RAM-only; everything reloads from sources on every boot.
-- `Pikuri::VectorDb::Backend::Chroma` — thin Faraday HTTP client
-  against a self-hosted ChromaDB. The persistent option.
-- `Pikuri::VectorDb::Embedder` — thin wrapper over `RubyLLM.embed`
-  so tests can inject a fake without monkey-patching ruby_llm.
-- `Pikuri::VectorDb::Reranker::LlamaServer` — optional quality
-  knob. Speaks `/v1/rerank` against a cross-encoder model on a
-  llama.cpp server. Passing `reranker: nil` to the extension
-  skips reranking; retrieval falls back to vector-only top-k.
-- `Pikuri::VectorDb::Chunker::FixedWindow` + `Tokenizer::*` —
-  the chunking pipeline. Tokenizer is a duck-typed protocol
-  (`count(text) -> Integer`) with two impls in v1:
-  `Tokenizer::CharHeuristic` (default, ~4 chars/token rule) and
-  `Tokenizer::LlamaServer` (POST `/tokenize` against the
-  embedder's endpoint).
-- Text extraction reuses `Pikuri::FileType.read_as_text` from
-  pikuri-core — plain text / Markdown / PDF. HTML extraction
-  is a deferred follow-up; v1 corpora skew toward Markdown
-  notes and PDF docs in practice.
-- `Pikuri::VectorDb::LIBRARIAN` — bundled
-  `Pikuri::SubAgent::Persona` constant. Hosts wire it via
-  `SubAgent::Extension.new(personas: [..., LIBRARIAN])` — same
-  shape `pikuri-code` uses for `GIT_REPO_RESEARCHER`.
+[pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit:
+semantic recall over a pile of files you point it at — your notes,
+your docs, your contracts — where the *agent* decides when to
+retrieve, same Thought → Tool-call → Observation loop as every other
+tool.
-## Install
+Wire it onto a `pikuri-core` agent the same way as `pikuri-tasks` /
+`pikuri-memory` — `c.add_extension` inside the `Agent.new` block:
 ```ruby
-# Gemfile
-gem 'pikuri-vectordb'
+require 'pikuri-vectordb'
+Pikuri::Agent.new(transport: ..., system_prompt: ...) do |c|
+  c.add_extension Pikuri::VectorDb::Extension.new(
+    backend: Pikuri::VectorDb::Backend::InMemory.new,
+    source:  '~/notes'
+  )
+end
 ```
-## Usage (preview — not yet wired)
+## What you get
+Three tools, registered by the extension:
+1. **`vectordb_search`** — embeds the query, pulls the top-k nearest
+   chunks from the backend, optionally reranks them with a
+   cross-encoder, and hands the agent a numbered list of
+   `source (score=…)` snippets as its next observation.
+2. **`vectordb_read`** — parent-document retrieval: when a search
+   surfaces a clean hit, the agent reads that whole document by its
+   `source` path instead of re-querying for more fragments of it.
+3. **`vectordb_reindex`** — rebuilds the index from the source, on
+   request.
+The extension registers the tools and nothing else — *populating*
+the index is the host's call, never something done behind your
+back. Three equally valid shapes:
+- Index at boot: `extension.indexer.index_if_empty!`.
+- Keep it live: run a `Pikuri::VectorDb::Watcher` around
+  `extension.indexer` — a filesystem-event daemon (the `listen`
+  gem) that sweeps once on boot and reindexes files as they change.
+- Leave it empty and let the user drive: the agent calls
+  `vectordb_reindex` when asked.
+## Backends
+Three implementations of one duck-typed interface
+(`#upsert` / `#query` / `#delete_all` / `#count`) — swapping is a
+one-line change:
+- `Backend::InMemory` — the educational default. Pure-Ruby cosine
+  over `Array<Float>`, ~40 lines, reads in one sitting. RAM-only:
+  everything reloads from sources on every boot.
+- `Backend::Qdrant` — thin Faraday HTTP client against a
+  self-hosted [Qdrant](https://qdrant.tech). **The recommended
+  persistent backend** — [`DESIGN.md`](DESIGN.md) has the engine
+  survey behind the pick.
+- `Backend::Chroma` — the supported
+  [ChromaDB](https://www.trychroma.com) alternative, identical
+  wiring.
+Each persistent engine pairs with a `Server::*` supervisor that
+runs it as a self-managed docker container: pinned image, a
+container name pikuri owns (`pikuri-internal-qdrant` /
+`pikuri-internal-chroma`), data bind-mounted under
+`~/.cache/pikuri/` so the corpus survives container recreation.
 ```ruby
-require 'pikuri-core'
-require 'pikuri-vectordb'
+# Supervised container (needs docker on PATH):
+backend = Pikuri::VectorDb::Server::Qdrant.ensure_running.client(
+  collection: 'my-docs'
+)
-backend = Pikuri::VectorDb::Backend::InMemory.new
-# Or for persistent storage:
-#   backend = Pikuri::VectorDb::Backend::Chroma.new(
-#     host: 'localhost', port: 8000, collection: 'my-docs',
-#   )
-agent = Pikuri::Agent.new(transport: ..., system_prompt: ...) do |c|
-  c.add_extension(
-    Pikuri::VectorDb::Extension.new(
-      backend: backend,
-      source: '~/notes',
-    )
-  )
-end
+# Or point at a Qdrant you already run:
+backend = Pikuri::VectorDb::Backend::Qdrant.new(
+  host: 'localhost', port: 6333, collection: 'my-docs'
+)
+```
+Collection naming is engine-specific so it lives on the backend
+constructor, not on the Extension — `Backend::InMemory` has no
+collection concept.
+## The indexing pipeline
+What `vectordb_reindex` (and the `Watcher`) actually runs, piece by
+piece — each swappable via the Extension's keyword arguments:
+- **Chunker** (`Chunker::FixedWindow`) — overlapping windows,
+  default 512 tokens with 50 of overlap, so an answer straddling a
+  boundary survives in at least one chunk.
+- **Tokenizer** (`Tokenizer::CharHeuristic` default /
+  `Tokenizer::LlamaServer`) — counts tokens for the chunker; the
+  heuristic is the offline ~4-chars-per-token rule, the
+  `LlamaServer` variant asks the embedder's `/tokenize` endpoint
+  for an exact count.
+- **Embedder** — thin wrapper over `RubyLLM.embed`; tests inject a
+  fake `#embed` without monkey-patching ruby_llm.
+- **Reranker** (`Reranker::LlamaServer`, optional) — cross-encoder
+  over `POST /v1/rerank`. Pass `reranker: nil` to skip it;
+  retrieval falls back to vector-only top-k — less precision, same
+  correctness.
+Text extraction reuses `Pikuri::FileType.read_as_text` from
+pikuri-core — plain text / Markdown / PDF. HTML extraction is a
+deferred follow-up.
+## Demo: `pikuri-corpus`
+From a source checkout (not installed by `gem install`):
+```sh
+./pikuri-vectordb/bin/pikuri-corpus --qdrant --watch
 ```
-Collection naming is Chroma-specific so it lives on
-`Backend::Chroma.new(collection:)`, not on the Extension —
-`Backend::InMemory` has no collection concept.
+A single recall agent over `docs/guide/` (the pikuri guide itself)
+with **no egress** — its tools are the three above plus
+`calculator`; no web search, no fetch, no bash. The corpus stands
+in for private data, and an agent that can read it must not also be
+able to send it out. `--qdrant` / `--chroma` persist the index
+across runs, `--watch` keeps it live, `--no-reranker` drops the
+reranker requirement. The guide's
+[chapter 3](../docs/guide/03-vectordb.md) is the full walkthrough.
-For hosts that want recall behind a privilege-separated sub-agent
-(the trifecta-defense pattern — see `SECURITY.md` and `IDEAS.md`
-§"Vector DB / RAG"), additionally wire the `LIBRARIAN` persona
-via `pikuri-subagents`:
+## The LIBRARIAN persona
+For hosts that want recall behind a privilege-separated sub-agent —
+the right shape once the *parent* agent has egress (see
+`SECURITY.md` at the repo root) — the bundled
+`Pikuri::VectorDb::LIBRARIAN` persona is opt-in via
+`pikuri-subagents`:
 ```ruby
 require 'pikuri-subagents'
@@ -88,13 +146,13 @@ c.add_extension(
 ## Three model endpoints
-A full assistant setup wants three LLM endpoints: chat (via
-`ruby_llm`), an embedder (via `RubyLLM.embed`), and an optional
-reranker (HTTP `/v1/rerank`). Recommended setup: **one
-`llama-server` running in router mode** — started with no
-`--model` flag, it serves every GGUF in `~/.cache/llama.cpp/`
-from a single port and loads whichever model each request asks
-for. Requires a recent enough `llama.cpp` build to include the
+A full setup wants three LLM endpoints: chat (via `ruby_llm`), an
+embedder (via `RubyLLM.embed`), and an optional reranker (HTTP
+`/v1/rerank`). Recommended setup: **one `llama-server` running in
+router mode** — started with no `--model` flag, it serves every
+GGUF in `~/.cache/llama.cpp/` from a single port and loads
+whichever model each request asks for. Requires a recent enough
+`llama.cpp` build to include the
 [model-management feature](https://huggingface.co/blog/ggml-org/model-management-in-llamacpp);
 Ubuntu 26.04+ packages one. The guide's
 [chapter 1](../docs/guide/01-chat.md) walks through the setup;
@@ -112,10 +170,24 @@ OpenAI-compatible endpoints and would also work, but pikuri's
 "small enough to audit" ethos keeps the recommended path on
 `llama.cpp` alone.
+## Install
+```ruby
+# Gemfile
+gem 'pikuri-vectordb'
+```
+Depends on `pikuri-core`, `pikuri-subagents` (the `Persona` value
+type `LIBRARIAN` is an instance of), and `listen` (filesystem
+events for the `Watcher`; loaded only when a `Watcher` starts).
 ## Further reading
-- **Design notes:** `IDEAS.md` §"Vector DB / RAG" at the repo
-  root.
+- **Guide chapter:** [Agentic search and the vector
+  DB](../docs/guide/03-vectordb.md) — concepts, model setup, the
+  no-egress argument, `--qdrant --watch` day-to-day shape.
+- **Design notes:** [`DESIGN.md`](DESIGN.md) — the Chroma-vs-Qdrant
+  engine survey.
 - **API reference:** browse the YARD docs at
   <https://rubydoc.info/gems/pikuri-vectordb> (once published),
   or run `bundle exec yard` in this directory for a local copy.

data/lib/pikuri/vector_db/backend/chroma.rb CHANGED Viewed

@@ -13,6 +13,14 @@ module Pikuri
       # on empty input + non-positive +top_k+. Where the two
       # diverge is the vector-dim contract — see below.
       #
+      # The client is hand-rolled rather than a dependency on a
+      # +chroma-db+ gem: only a handful of v2 endpoints are needed
+      # (listed below), Faraday is already in the dependency closure,
+      # and a thin first-party client keeps the wire protocol
+      # auditable in one readable file — consistent with the
+      # read-it-in-an-evening ceiling. The cost is tracking Chroma's
+      # v2 API by hand if it changes.
+      #
       # == Two ways to get one
       #
       # * **Bring your own.** +Backend::Chroma.new(host:, port:,
@@ -21,7 +29,7 @@ module Pikuri
       #   already running on the host for an unrelated project).
       #   The host owns the process; this class is purely the
       #   HTTP client.
-      # * **Let pikuri manage it.** {ChromaServer.ensure_running}
+      # * **Let pikuri manage it.** {Server::Chroma.ensure_running}
       #   spawns and supervises a chroma container under the
       #   +pikuri-internal-chroma+ name, against a pinned image,
       #   with a bind-mounted volume in the user's cache dir.
@@ -46,6 +54,11 @@ module Pikuri
       #   count.
       # * +DELETE /api/v2/.../collections/{id}+ — drop the
       #   collection (used by +#delete_all+).
+      # * +POST /api/v2/.../collections/{id}/delete+ — metadata-
+      #   filtered delete (+{where:}+); used by +#delete_by_source+.
+      # * +POST /api/v2/.../collections/{id}/get+ — fetch rows by
+      #   +{where:}+ filter with an +include:+ projection; used by
+      #   +#sources_with_hashes+.
       #
       # == BYO embeddings (not Chroma's embedder)
       #
@@ -110,6 +123,15 @@ module Pikuri
       # drift. Real-Chroma smoke testing is wired into the demo
       # binary in a later phase. Targets Chroma 0.5.x+ (v2 API).
       class Chroma
+        # Rows per +/get+ page in {#sources_with_hashes}. Caps the
+        # JSON burst + parse working set of the boot manifest read on
+        # a large corpus; small corpora finish in one page. Chunky but
+        # not arbitrary — one round trip per this-many *files*, and the
+        # manifest is one row per file (the +offset 0+ chunk), so a
+        # 50k-file corpus is ~50 localhost round trips instead of one
+        # multi-MB response.
+        MANIFEST_PAGE_SIZE = 1_000
         # @param host [String]
         # @param port [Integer]
         # @param collection [String] collection name in Chroma.
@@ -271,6 +293,121 @@ module Pikuri
           raise "Backend::Chroma: count response was not an Integer (got #{body.inspect})"
         end
+        # Remove every chunk whose +source+ matches, via a
+        # metadata-filtered +POST .../delete+ (+source+ is the
+        # reserved metadata key {#upsert} writes). The scoped
+        # counterpart to {#delete_all}. No-op when the collection
+        # doesn't exist yet.
+        #
+        # @param source [String] the {Chunk#source} to purge.
+        # @return [void]
+        # @raise [RuntimeError] on HTTP failure.
+        def delete_by_source(source)
+          return nil if @collection_id.nil? && !collection_exists?
+          post_json("#{collection_path}/delete", { where: { 'source' => source } })
+          nil
+        end
+        # Replace all chunks for one +source+: delete the old set,
+        # then upsert the new one. The incremental-reindex unit
+        # (see {Indexer#reindex_file!}).
+        #
+        # == Not transactional (the InMemory divergence)
+        #
+        # These are two HTTP calls, so a +#query+ landing between
+        # them can see the source with zero chunks — a window
+        # {InMemory#replace_source} closes with its monitor but
+        # Chroma cannot, short of server-side transactions it
+        # doesn't expose. The window is small and the
+        # {Indexer} mitigates the *common* failure: it embeds
+        # before calling here, so an embedder outage never reaches
+        # this method and the old chunks stay put. Delete-then-upsert
+        # (not the reverse): upserting first then deleting by source
+        # would delete the just-written chunks.
+        #
+        # @param source [String] the {Chunk#source} being replaced.
+        # @param chunks [Array<Chunk>] the new chunk set.
+        # @param vectors [Array<Array<Float>>] parallel to +chunks+.
+        # @return [void]
+        # @raise [ArgumentError] on empty input or length mismatch.
+        # @raise [RuntimeError] on HTTP failure.
+        def replace_source(source:, chunks:, vectors:)
+          delete_by_source(source)
+          upsert(chunks: chunks, vectors: vectors)
+          nil
+        end
+        # The boot-sweep reference: +source+ → stored content hash
+        # for every indexed document. Reads one metadata row per
+        # *file*, not per chunk, via three Chroma +/get+ knobs:
+        #
+        # * +where: { offset: 0 }+ — every file has exactly one
+        #   chunk at offset 0, so this returns one row per source.
+        # * +include: ['metadatas']+ — drops the heavy +embeddings+
+        #   and +documents+ from the response; we pull only the
+        #   metadata projection, never the vectors.
+        # * +limit+ / +offset+ — page the read in
+        #   {MANIFEST_PAGE_SIZE} chunks so a large corpus never
+        #   materializes one multi-MB response. (Two unrelated
+        #   +offset+s collide in the wording: the +where+ +offset+ is
+        #   a *chunk* metadata field; the top-level +offset+ is the
+        #   *pagination* cursor — different namespaces in the API.)
+        #
+        # Pagination assumes the manifest isn't mutating mid-read; the
+        # {Watcher} drives this from its single worker thread, so no
+        # reindex runs concurrently with the boot sweep that calls it.
+        #
+        # @return [Hash{String => String, nil}] +source+ → content
+        #   hash. Empty when the collection doesn't exist yet.
+        # @raise [RuntimeError] on HTTP failure.
+        def sources_with_hashes
+          return {} if @collection_id.nil? && !collection_exists?
+          result = {}
+          cursor = 0
+          loop do
+            body = post_json("#{collection_path}/get", {
+                               where: { 'offset' => 0 },
+                               include: ['metadatas'],
+                               limit: MANIFEST_PAGE_SIZE,
+                               offset: cursor
+                             })
+            metas = body.is_a?(Hash) ? (body['metadatas'] || []) : []
+            metas.each do |meta|
+              next unless meta.is_a?(Hash) && meta['source']
+              result[meta['source']] = meta['hash']
+            end
+            break if metas.size < MANIFEST_PAGE_SIZE
+            cursor += metas.size
+          end
+          result
+        end
+        # Is +source+ in the corpus? Scoped existence check for
+        # {VectorDb::Tools::Read}'s membership gate: a +where+-filtered
+        # +/get+ capped at one row, +include: []+ so the response
+        # carries only ids — O(1) transport regardless of corpus
+        # size, never the full {#sources_with_hashes} manifest. See
+        # the Backend protocol yardoc.
+        #
+        # @param source [String] the {Chunk#source} to test.
+        # @return [Boolean] true if at least one chunk has this source.
+        # @raise [RuntimeError] on HTTP failure.
+        def source_indexed?(source)
+          return false if @collection_id.nil? && !collection_exists?
+          body = post_json("#{collection_path}/get", {
+                             where: { 'source' => source },
+                             include: [],
+                             limit: 1
+                           })
+          ids = body.is_a?(Hash) ? (body['ids'] || []) : []
+          !ids.empty?
+        end
         private
         def collections_path

data/lib/pikuri/vector_db/backend/in_memory.rb CHANGED Viewed

@@ -1,12 +1,13 @@
 # frozen_string_literal: true
+require 'monitor'
 module Pikuri
   module VectorDb
     module Backend
-      # Pure-Ruby vector store. The educational default backend;
-      # IDEAS.md §"Vector DB / RAG" frames it as the "small enough
-      # to audit" first stop the demo + guide walk through before
-      # promoting users to +Chroma+ for persistence.
+      # Pure-Ruby vector store. The educational default backend —
+      # the "small enough to audit" first stop the demo + guide walk
+      # through before promoting users to +Chroma+ for persistence.
       #
       # == What it does
       #
@@ -30,9 +31,23 @@ module Pikuri
       # * **No approximate search.** Exhaustive scan. Approximate
       #   nearest neighbor (HNSW, IVF) adds complexity that doesn't
       #   teach anything additional once the cosine math is clear.
-      # * **No thread safety.** {Indexer} runs single-threaded
-      #   during a boot or reindex; {Search} calls +#query+ from
-      #   the agent's main thread. No concurrent access today.
+      # * **No approximate-search index.** Exhaustive scan only.
+      #
+      # == Thread safety
+      #
+      # Every public method runs under a single reentrant
+      # +Monitor+. The agent's main thread calls +#query+ while a
+      # background {Watcher} thread calls +#replace_source+ /
+      # +#delete_by_source+, so concurrent access is real once
+      # auto-watch is wired. The lock's load-bearing job is
+      # {#replace_source}: it holds the monitor across the
+      # delete-then-upsert so a concurrent +#query+ never observes
+      # the gap where a source has zero chunks. +Monitor+ (not a
+      # bare +Mutex+) because +#replace_source+ re-enters the lock
+      # via +#delete_by_source+ + +#upsert+, which a non-reentrant
+      # +Mutex+ would deadlock on. +Chroma+ needs no client-side
+      # lock — the server serializes — so this is the one backend
+      # that locks.
       #
       # == Cosine, not dot product
       #
@@ -53,6 +68,10 @@ module Pikuri
           # enforced for every subsequent +#upsert+ + +#query+ — see
           # the Backend protocol's "Vector-dim contract" yardoc.
           @dim = nil
+          # Reentrant so +#replace_source+ can call +#delete_by_source+
+          # + +#upsert+ while holding the lock — see the class yardoc's
+          # "Thread safety" section.
+          @lock = Monitor.new
         end
         # Insert-or-replace by +chunk.id+. Parallel arrays of
@@ -71,15 +90,17 @@ module Pikuri
             raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors"
           end
-          expected = @dim || vectors.first.size
-          vectors.each_with_index do |v, i|
-            next if v.size == expected
+          @lock.synchronize do
+            expected = @dim || vectors.first.size
+            vectors.each_with_index do |v, i|
+              next if v.size == expected
-            raise ArgumentError, "vector #{i} has dim #{v.size}, expected #{expected}"
-          end
-          @dim ||= expected
+              raise ArgumentError, "vector #{i} has dim #{v.size}, expected #{expected}"
+            end
+            @dim ||= expected
-          chunks.zip(vectors).each { |chunk, vector| @entries[chunk.id] = [chunk, vector] }
+            chunks.zip(vectors).each { |chunk, vector| @entries[chunk.id] = [chunk, vector] }
+          end
           nil
         end
@@ -96,16 +117,19 @@ module Pikuri
         #   dim mismatch.
         def query(vector:, top_k:)
           raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0
-          return [] if @entries.empty?
-          if vector.size != @dim
-            raise ArgumentError, "query vector dim #{vector.size}, stored dim #{@dim}"
-          end
+          @lock.synchronize do
+            return [] if @entries.empty?
-          scored = @entries.values.map do |chunk, stored|
-            Result.new(chunk: chunk, score: cosine(vector, stored))
+            if vector.size != @dim
+              raise ArgumentError, "query vector dim #{vector.size}, stored dim #{@dim}"
+            end
+            scored = @entries.values.map do |chunk, stored|
+              Result.new(chunk: chunk, score: cosine(vector, stored))
+            end
+            scored.sort_by { |r| -r.score }.first(top_k)
           end
-          scored.sort_by { |r| -r.score }.first(top_k)
         end
         # Drop every stored chunk. Used by the v1 nuke-and-reload
@@ -114,14 +138,89 @@ module Pikuri
         #
         # @return [void]
         def delete_all
-          @entries.clear
-          @dim = nil
+          @lock.synchronize do
+            @entries.clear
+            @dim = nil
+          end
           nil
         end
         # @return [Integer] current chunk count.
         def count
-          @entries.size
+          @lock.synchronize { @entries.size }
+        end
+        # Remove every chunk whose +source+ matches. The scoped
+        # counterpart to {#delete_all} — drops one document's chunks
+        # without touching the rest. No-op (and no error) when the
+        # source isn't present. The dim lock is left intact: unlike
+        # {#delete_all}, a per-source delete doesn't imply an
+        # embedder change.
+        #
+        # @param source [String] the {Chunk#source} to purge, e.g.
+        #   +"notes/cooking.md"+.
+        # @return [void]
+        def delete_by_source(source)
+          @lock.synchronize do
+            @entries.reject! { |_id, (chunk, _vector)| chunk.source == source }
+          end
+          nil
+        end
+        # Atomically replace all chunks for one +source+: delete the
+        # old set, then upsert the new one, under a single hold of the
+        # monitor. The incremental-reindex unit (see {Indexer#reindex_file!}).
+        # Holding the lock across both halves is the point — a
+        # concurrent {#query} sees either the old chunks or the new
+        # ones, never the empty gap between.
+        #
+        # @param source [String] the {Chunk#source} being replaced.
+        # @param chunks [Array<Chunk>] the new chunk set; every
+        #   +chunk.source+ should equal +source+.
+        # @param vectors [Array<Array<Float>>] parallel to +chunks+.
+        # @return [void]
+        # @raise [ArgumentError] on empty input, length mismatch, or
+        #   vector-dim mismatch (from the inner {#upsert}).
+        def replace_source(source:, chunks:, vectors:)
+          @lock.synchronize do
+            delete_by_source(source)
+            upsert(chunks: chunks, vectors: vectors)
+          end
+          nil
+        end
+        # The boot-sweep reference: a map from each indexed +source+
+        # to the content hash stored on its chunks. {Watcher} (via
+        # {Indexer#reconcile_plan}) diffs this against the hashes of
+        # the files currently on disk to decide what to reindex.
+        # Built from chunk metadata; a chunk indexed before the
+        # +hash+ metadata existed maps its source to +nil+, which the
+        # diff treats as "changed" and reindexes — self-healing.
+        #
+        # @return [Hash{String => String, nil}] +source+ → content
+        #   hash. Empty when nothing is indexed (the InMemory case at
+        #   every boot, since RAM resets).
+        def sources_with_hashes
+          @lock.synchronize do
+            result = {}
+            @entries.each_value do |chunk, _vector|
+              result[chunk.source] ||= chunk.metadata[:hash]
+            end
+            result
+          end
+        end
+        # Is +source+ in the corpus? The scoped membership test
+        # behind {VectorDb::Tools::Read}'s gate — a short-circuiting scan
+        # rather than building the whole {#sources_with_hashes} map
+        # just to read one key. See the Backend protocol yardoc.
+        #
+        # @param source [String] the {Chunk#source} to test.
+        # @return [Boolean] true if at least one chunk has this source.
+        def source_indexed?(source)
+          @lock.synchronize do
+            @entries.each_value.any? { |chunk, _vector| chunk.source == source }
+          end
         end
         private