pikuri-vectordb 0.0.4 → 0.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3a64900380a7c48712b1ea58053f141d1c6f5da57e723925ca24319899e91607
4
- data.tar.gz: 5ed708f037e0cd979e1a2018b9fabf0d996679befbcca69def2d58ce8aac8fd4
3
+ metadata.gz: 2560026784f6f00e160c090718c8896835b129fb6d81e4efa1570917aa2414d5
4
+ data.tar.gz: 22e7054b58ae018b1b4afcde583c70ad979aa4e859348892d3a0640e0541c288
5
5
  SHA512:
6
- metadata.gz: ada1509afe42590bae569b032cc88fe4ae2ed724a152388fc710b3fe5275f256fee2af6e2db29bb82469d11d651698a653addf10a7679df4b57acadaaab76720
7
- data.tar.gz: d25268e97f43a4c73fb4eea1e957a3c9129b4d543e3c78e767318cc784eb0898ec3ba647210b6d969521a2214c6934c5f2aadb387e32f98ca655ff5eee0bb92c
6
+ metadata.gz: 9c957518955fa72da1d998f041422a32b78301e488c3a5e93eb13cda4a779a81ae83b3e9ddd3041c5972973b9684c039760f77fd6c7f82f2e94a177e361d66b8
7
+ data.tar.gz: 420446ae9178260b0cc9e0f4fa2950c9330a74c298e83aad9b1d76e73809f67a99b9cd460a177353f63560871b2bf412fd5d3d3e7f9d22360a458946e538aabb
data/README.md CHANGED
@@ -1,80 +1,138 @@
1
1
  # pikuri-vectordb
2
2
 
3
3
  Local-corpus vector search + agentic RAG for the
4
- [pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit.
5
-
6
- > **Status:** skeletongem scaffolding only. The
7
- > `Pikuri::VectorDb::Extension` and `vectordb_search` tool are
8
- > being built in subsequent commits. See `IDEAS.md` §"Vector DB /
9
- > RAG" for the design.
10
-
11
- Will provide:
12
-
13
- - `Pikuri::VectorDb::Extension` — wires a `vectordb_search` tool +
14
- a `vectordb_reindex` tool onto a `Pikuri::Agent` via
15
- `c.add_extension(...)` inside the `Agent.new` block.
16
- - `Pikuri::VectorDb::Backend::InMemory` — pure-Ruby cosine over
17
- `Array<Float>`. The educational default; reads in ~40 lines.
18
- RAM-only; everything reloads from sources on every boot.
19
- - `Pikuri::VectorDb::Backend::Chroma` — thin Faraday HTTP client
20
- against a self-hosted ChromaDB. The persistent option.
21
- - `Pikuri::VectorDb::Embedder` — thin wrapper over `RubyLLM.embed`
22
- so tests can inject a fake without monkey-patching ruby_llm.
23
- - `Pikuri::VectorDb::Reranker::LlamaServer` — optional quality
24
- knob. Speaks `/v1/rerank` against a cross-encoder model on a
25
- llama.cpp server. Passing `reranker: nil` to the extension
26
- skips reranking; retrieval falls back to vector-only top-k.
27
- - `Pikuri::VectorDb::Chunker::FixedWindow` + `Tokenizer::*` —
28
- the chunking pipeline. Tokenizer is a duck-typed protocol
29
- (`count(text) -> Integer`) with two impls in v1:
30
- `Tokenizer::CharHeuristic` (default, ~4 chars/token rule) and
31
- `Tokenizer::LlamaServer` (POST `/tokenize` against the
32
- embedder's endpoint).
33
- - Text extraction reuses `Pikuri::FileType.read_as_text` from
34
- pikuri-core — plain text / Markdown / PDF. HTML extraction
35
- is a deferred follow-up; v1 corpora skew toward Markdown
36
- notes and PDF docs in practice.
37
- - `Pikuri::VectorDb::LIBRARIAN` — bundled
38
- `Pikuri::SubAgent::Persona` constant. Hosts wire it via
39
- `SubAgent::Extension.new(personas: [..., LIBRARIAN])` — same
40
- shape `pikuri-code` uses for `GIT_REPO_RESEARCHER`.
4
+ [pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit:
5
+ semantic recall over a pile of files you point it at — your notes,
6
+ your docs, your contracts where the *agent* decides when to
7
+ retrieve, same Thought Tool-call → Observation loop as every other
8
+ tool.
41
9
 
42
- ## Install
10
+ Wire it onto a `pikuri-core` agent the same way as `pikuri-tasks` /
11
+ `pikuri-memory` — `c.add_extension` inside the `Agent.new` block:
43
12
 
44
13
  ```ruby
45
- # Gemfile
46
- gem 'pikuri-vectordb'
14
+ require 'pikuri-vectordb'
15
+
16
+ Pikuri::Agent.new(transport: ..., system_prompt: ...) do |c|
17
+ c.add_extension Pikuri::VectorDb::Extension.new(
18
+ backend: Pikuri::VectorDb::Backend::InMemory.new,
19
+ source: '~/notes'
20
+ )
21
+ end
47
22
  ```
48
23
 
49
- ## Usage (preview — not yet wired)
24
+ ## What you get
25
+
26
+ Three tools, registered by the extension:
27
+
28
+ 1. **`vectordb_search`** — embeds the query, pulls the top-k nearest
29
+ chunks from the backend, optionally reranks them with a
30
+ cross-encoder, and hands the agent a numbered list of
31
+ `source (score=…)` snippets as its next observation.
32
+ 2. **`vectordb_read`** — parent-document retrieval: when a search
33
+ surfaces a clean hit, the agent reads that whole document by its
34
+ `source` path instead of re-querying for more fragments of it.
35
+ 3. **`vectordb_reindex`** — rebuilds the index from the source, on
36
+ request.
37
+
38
+ The extension registers the tools and nothing else — *populating*
39
+ the index is the host's call, never something done behind your
40
+ back. Three equally valid shapes:
41
+
42
+ - Index at boot: `extension.indexer.index_if_empty!`.
43
+ - Keep it live: run a `Pikuri::VectorDb::Watcher` around
44
+ `extension.indexer` — a filesystem-event daemon (the `listen`
45
+ gem) that sweeps once on boot and reindexes files as they change.
46
+ - Leave it empty and let the user drive: the agent calls
47
+ `vectordb_reindex` when asked.
48
+
49
+ ## Backends
50
+
51
+ Three implementations of one duck-typed interface
52
+ (`#upsert` / `#query` / `#delete_all` / `#count`) — swapping is a
53
+ one-line change:
54
+
55
+ - `Backend::InMemory` — the educational default. Pure-Ruby cosine
56
+ over `Array<Float>`, ~40 lines, reads in one sitting. RAM-only:
57
+ everything reloads from sources on every boot.
58
+ - `Backend::Qdrant` — thin Faraday HTTP client against a
59
+ self-hosted [Qdrant](https://qdrant.tech). **The recommended
60
+ persistent backend** — [`DESIGN.md`](DESIGN.md) has the engine
61
+ survey behind the pick.
62
+ - `Backend::Chroma` — the supported
63
+ [ChromaDB](https://www.trychroma.com) alternative, identical
64
+ wiring.
65
+
66
+ Each persistent engine pairs with a `Server::*` supervisor that
67
+ runs it as a self-managed docker container: pinned image, a
68
+ container name pikuri owns (`pikuri-internal-qdrant` /
69
+ `pikuri-internal-chroma`), data bind-mounted under
70
+ `~/.cache/pikuri/` so the corpus survives container recreation.
50
71
 
51
72
  ```ruby
52
- require 'pikuri-core'
53
- require 'pikuri-vectordb'
73
+ # Supervised container (needs docker on PATH):
74
+ backend = Pikuri::VectorDb::Server::Qdrant.ensure_running.client(
75
+ collection: 'my-docs'
76
+ )
54
77
 
55
- backend = Pikuri::VectorDb::Backend::InMemory.new
56
- # Or for persistent storage:
57
- # backend = Pikuri::VectorDb::Backend::Chroma.new(
58
- # host: 'localhost', port: 8000, collection: 'my-docs',
59
- # )
60
- agent = Pikuri::Agent.new(transport: ..., system_prompt: ...) do |c|
61
- c.add_extension(
62
- Pikuri::VectorDb::Extension.new(
63
- backend: backend,
64
- source: '~/notes',
65
- )
66
- )
67
- end
78
+ # Or point at a Qdrant you already run:
79
+ backend = Pikuri::VectorDb::Backend::Qdrant.new(
80
+ host: 'localhost', port: 6333, collection: 'my-docs'
81
+ )
82
+ ```
83
+
84
+ Collection naming is engine-specific so it lives on the backend
85
+ constructor, not on the Extension — `Backend::InMemory` has no
86
+ collection concept.
87
+
88
+ ## The indexing pipeline
89
+
90
+ What `vectordb_reindex` (and the `Watcher`) actually runs, piece by
91
+ piece — each swappable via the Extension's keyword arguments:
92
+
93
+ - **Chunker** (`Chunker::FixedWindow`) — overlapping windows,
94
+ default 512 tokens with 50 of overlap, so an answer straddling a
95
+ boundary survives in at least one chunk.
96
+ - **Tokenizer** (`Tokenizer::CharHeuristic` default /
97
+ `Tokenizer::LlamaServer`) — counts tokens for the chunker; the
98
+ heuristic is the offline ~4-chars-per-token rule, the
99
+ `LlamaServer` variant asks the embedder's `/tokenize` endpoint
100
+ for an exact count.
101
+ - **Embedder** — thin wrapper over `RubyLLM.embed`; tests inject a
102
+ fake `#embed` without monkey-patching ruby_llm.
103
+ - **Reranker** (`Reranker::LlamaServer`, optional) — cross-encoder
104
+ over `POST /v1/rerank`. Pass `reranker: nil` to skip it;
105
+ retrieval falls back to vector-only top-k — less precision, same
106
+ correctness.
107
+
108
+ Text extraction reuses `Pikuri::FileType.read_as_text` from
109
+ pikuri-core — plain text / Markdown / PDF. HTML extraction is a
110
+ deferred follow-up.
111
+
112
+ ## Demo: `pikuri-corpus`
113
+
114
+ From a source checkout (not installed by `gem install`):
115
+
116
+ ```sh
117
+ ./pikuri-vectordb/bin/pikuri-corpus --qdrant --watch
68
118
  ```
69
119
 
70
- Collection naming is Chroma-specific so it lives on
71
- `Backend::Chroma.new(collection:)`, not on the Extension
72
- `Backend::InMemory` has no collection concept.
120
+ A single recall agent over `docs/guide/` (the pikuri guide itself)
121
+ with **no egress** — its tools are the three above plus
122
+ `calculator`; no web search, no fetch, no bash. The corpus stands
123
+ in for private data, and an agent that can read it must not also be
124
+ able to send it out. `--qdrant` / `--chroma` persist the index
125
+ across runs, `--watch` keeps it live, `--no-reranker` drops the
126
+ reranker requirement. The guide's
127
+ [chapter 3](../docs/guide/03-vectordb.md) is the full walkthrough.
73
128
 
74
- For hosts that want recall behind a privilege-separated sub-agent
75
- (the trifecta-defense pattern — see `SECURITY.md` and `IDEAS.md`
76
- §"Vector DB / RAG"), additionally wire the `LIBRARIAN` persona
77
- via `pikuri-subagents`:
129
+ ## The LIBRARIAN persona
130
+
131
+ For hosts that want recall behind a privilege-separated sub-agent —
132
+ the right shape once the *parent* agent has egress (see
133
+ `SECURITY.md` at the repo root) — the bundled
134
+ `Pikuri::VectorDb::LIBRARIAN` persona is opt-in via
135
+ `pikuri-subagents`:
78
136
 
79
137
  ```ruby
80
138
  require 'pikuri-subagents'
@@ -88,13 +146,13 @@ c.add_extension(
88
146
 
89
147
  ## Three model endpoints
90
148
 
91
- A full assistant setup wants three LLM endpoints: chat (via
92
- `ruby_llm`), an embedder (via `RubyLLM.embed`), and an optional
93
- reranker (HTTP `/v1/rerank`). Recommended setup: **one
94
- `llama-server` running in router mode** — started with no
95
- `--model` flag, it serves every GGUF in `~/.cache/llama.cpp/`
96
- from a single port and loads whichever model each request asks
97
- for. Requires a recent enough `llama.cpp` build to include the
149
+ A full setup wants three LLM endpoints: chat (via `ruby_llm`), an
150
+ embedder (via `RubyLLM.embed`), and an optional reranker (HTTP
151
+ `/v1/rerank`). Recommended setup: **one `llama-server` running in
152
+ router mode** — started with no `--model` flag, it serves every
153
+ GGUF in `~/.cache/llama.cpp/` from a single port and loads
154
+ whichever model each request asks for. Requires a recent enough
155
+ `llama.cpp` build to include the
98
156
  [model-management feature](https://huggingface.co/blog/ggml-org/model-management-in-llamacpp);
99
157
  Ubuntu 26.04+ packages one. The guide's
100
158
  [chapter 1](../docs/guide/01-chat.md) walks through the setup;
@@ -112,10 +170,24 @@ OpenAI-compatible endpoints and would also work, but pikuri's
112
170
  "small enough to audit" ethos keeps the recommended path on
113
171
  `llama.cpp` alone.
114
172
 
173
+ ## Install
174
+
175
+ ```ruby
176
+ # Gemfile
177
+ gem 'pikuri-vectordb'
178
+ ```
179
+
180
+ Depends on `pikuri-core`, `pikuri-subagents` (the `Persona` value
181
+ type `LIBRARIAN` is an instance of), and `listen` (filesystem
182
+ events for the `Watcher`; loaded only when a `Watcher` starts).
183
+
115
184
  ## Further reading
116
185
 
117
- - **Design notes:** `IDEAS.md` §"Vector DB / RAG" at the repo
118
- root.
186
+ - **Guide chapter:** [Agentic search and the vector
187
+ DB](../docs/guide/03-vectordb.md) — concepts, model setup, the
188
+ no-egress argument, `--qdrant --watch` day-to-day shape.
189
+ - **Design notes:** [`DESIGN.md`](DESIGN.md) — the Chroma-vs-Qdrant
190
+ engine survey.
119
191
  - **API reference:** browse the YARD docs at
120
192
  <https://rubydoc.info/gems/pikuri-vectordb> (once published),
121
193
  or run `bundle exec yard` in this directory for a local copy.
@@ -13,6 +13,14 @@ module Pikuri
13
13
  # on empty input + non-positive +top_k+. Where the two
14
14
  # diverge is the vector-dim contract — see below.
15
15
  #
16
+ # The client is hand-rolled rather than a dependency on a
17
+ # +chroma-db+ gem: only a handful of v2 endpoints are needed
18
+ # (listed below), Faraday is already in the dependency closure,
19
+ # and a thin first-party client keeps the wire protocol
20
+ # auditable in one readable file — consistent with the
21
+ # read-it-in-an-evening ceiling. The cost is tracking Chroma's
22
+ # v2 API by hand if it changes.
23
+ #
16
24
  # == Two ways to get one
17
25
  #
18
26
  # * **Bring your own.** +Backend::Chroma.new(host:, port:,
@@ -21,7 +29,7 @@ module Pikuri
21
29
  # already running on the host for an unrelated project).
22
30
  # The host owns the process; this class is purely the
23
31
  # HTTP client.
24
- # * **Let pikuri manage it.** {ChromaServer.ensure_running}
32
+ # * **Let pikuri manage it.** {Server::Chroma.ensure_running}
25
33
  # spawns and supervises a chroma container under the
26
34
  # +pikuri-internal-chroma+ name, against a pinned image,
27
35
  # with a bind-mounted volume in the user's cache dir.
@@ -46,6 +54,11 @@ module Pikuri
46
54
  # count.
47
55
  # * +DELETE /api/v2/.../collections/{id}+ — drop the
48
56
  # collection (used by +#delete_all+).
57
+ # * +POST /api/v2/.../collections/{id}/delete+ — metadata-
58
+ # filtered delete (+{where:}+); used by +#delete_by_source+.
59
+ # * +POST /api/v2/.../collections/{id}/get+ — fetch rows by
60
+ # +{where:}+ filter with an +include:+ projection; used by
61
+ # +#sources_with_hashes+.
49
62
  #
50
63
  # == BYO embeddings (not Chroma's embedder)
51
64
  #
@@ -110,6 +123,15 @@ module Pikuri
110
123
  # drift. Real-Chroma smoke testing is wired into the demo
111
124
  # binary in a later phase. Targets Chroma 0.5.x+ (v2 API).
112
125
  class Chroma
126
+ # Rows per +/get+ page in {#sources_with_hashes}. Caps the
127
+ # JSON burst + parse working set of the boot manifest read on
128
+ # a large corpus; small corpora finish in one page. Chunky but
129
+ # not arbitrary — one round trip per this-many *files*, and the
130
+ # manifest is one row per file (the +offset 0+ chunk), so a
131
+ # 50k-file corpus is ~50 localhost round trips instead of one
132
+ # multi-MB response.
133
+ MANIFEST_PAGE_SIZE = 1_000
134
+
113
135
  # @param host [String]
114
136
  # @param port [Integer]
115
137
  # @param collection [String] collection name in Chroma.
@@ -271,6 +293,121 @@ module Pikuri
271
293
  raise "Backend::Chroma: count response was not an Integer (got #{body.inspect})"
272
294
  end
273
295
 
296
+ # Remove every chunk whose +source+ matches, via a
297
+ # metadata-filtered +POST .../delete+ (+source+ is the
298
+ # reserved metadata key {#upsert} writes). The scoped
299
+ # counterpart to {#delete_all}. No-op when the collection
300
+ # doesn't exist yet.
301
+ #
302
+ # @param source [String] the {Chunk#source} to purge.
303
+ # @return [void]
304
+ # @raise [RuntimeError] on HTTP failure.
305
+ def delete_by_source(source)
306
+ return nil if @collection_id.nil? && !collection_exists?
307
+
308
+ post_json("#{collection_path}/delete", { where: { 'source' => source } })
309
+ nil
310
+ end
311
+
312
+ # Replace all chunks for one +source+: delete the old set,
313
+ # then upsert the new one. The incremental-reindex unit
314
+ # (see {Indexer#reindex_file!}).
315
+ #
316
+ # == Not transactional (the InMemory divergence)
317
+ #
318
+ # These are two HTTP calls, so a +#query+ landing between
319
+ # them can see the source with zero chunks — a window
320
+ # {InMemory#replace_source} closes with its monitor but
321
+ # Chroma cannot, short of server-side transactions it
322
+ # doesn't expose. The window is small and the
323
+ # {Indexer} mitigates the *common* failure: it embeds
324
+ # before calling here, so an embedder outage never reaches
325
+ # this method and the old chunks stay put. Delete-then-upsert
326
+ # (not the reverse): upserting first then deleting by source
327
+ # would delete the just-written chunks.
328
+ #
329
+ # @param source [String] the {Chunk#source} being replaced.
330
+ # @param chunks [Array<Chunk>] the new chunk set.
331
+ # @param vectors [Array<Array<Float>>] parallel to +chunks+.
332
+ # @return [void]
333
+ # @raise [ArgumentError] on empty input or length mismatch.
334
+ # @raise [RuntimeError] on HTTP failure.
335
+ def replace_source(source:, chunks:, vectors:)
336
+ delete_by_source(source)
337
+ upsert(chunks: chunks, vectors: vectors)
338
+ nil
339
+ end
340
+
341
+ # The boot-sweep reference: +source+ → stored content hash
342
+ # for every indexed document. Reads one metadata row per
343
+ # *file*, not per chunk, via three Chroma +/get+ knobs:
344
+ #
345
+ # * +where: { offset: 0 }+ — every file has exactly one
346
+ # chunk at offset 0, so this returns one row per source.
347
+ # * +include: ['metadatas']+ — drops the heavy +embeddings+
348
+ # and +documents+ from the response; we pull only the
349
+ # metadata projection, never the vectors.
350
+ # * +limit+ / +offset+ — page the read in
351
+ # {MANIFEST_PAGE_SIZE} chunks so a large corpus never
352
+ # materializes one multi-MB response. (Two unrelated
353
+ # +offset+s collide in the wording: the +where+ +offset+ is
354
+ # a *chunk* metadata field; the top-level +offset+ is the
355
+ # *pagination* cursor — different namespaces in the API.)
356
+ #
357
+ # Pagination assumes the manifest isn't mutating mid-read; the
358
+ # {Watcher} drives this from its single worker thread, so no
359
+ # reindex runs concurrently with the boot sweep that calls it.
360
+ #
361
+ # @return [Hash{String => String, nil}] +source+ → content
362
+ # hash. Empty when the collection doesn't exist yet.
363
+ # @raise [RuntimeError] on HTTP failure.
364
+ def sources_with_hashes
365
+ return {} if @collection_id.nil? && !collection_exists?
366
+
367
+ result = {}
368
+ cursor = 0
369
+ loop do
370
+ body = post_json("#{collection_path}/get", {
371
+ where: { 'offset' => 0 },
372
+ include: ['metadatas'],
373
+ limit: MANIFEST_PAGE_SIZE,
374
+ offset: cursor
375
+ })
376
+ metas = body.is_a?(Hash) ? (body['metadatas'] || []) : []
377
+ metas.each do |meta|
378
+ next unless meta.is_a?(Hash) && meta['source']
379
+
380
+ result[meta['source']] = meta['hash']
381
+ end
382
+ break if metas.size < MANIFEST_PAGE_SIZE
383
+
384
+ cursor += metas.size
385
+ end
386
+ result
387
+ end
388
+
389
+ # Is +source+ in the corpus? Scoped existence check for
390
+ # {VectorDb::Tools::Read}'s membership gate: a +where+-filtered
391
+ # +/get+ capped at one row, +include: []+ so the response
392
+ # carries only ids — O(1) transport regardless of corpus
393
+ # size, never the full {#sources_with_hashes} manifest. See
394
+ # the Backend protocol yardoc.
395
+ #
396
+ # @param source [String] the {Chunk#source} to test.
397
+ # @return [Boolean] true if at least one chunk has this source.
398
+ # @raise [RuntimeError] on HTTP failure.
399
+ def source_indexed?(source)
400
+ return false if @collection_id.nil? && !collection_exists?
401
+
402
+ body = post_json("#{collection_path}/get", {
403
+ where: { 'source' => source },
404
+ include: [],
405
+ limit: 1
406
+ })
407
+ ids = body.is_a?(Hash) ? (body['ids'] || []) : []
408
+ !ids.empty?
409
+ end
410
+
274
411
  private
275
412
 
276
413
  def collections_path
@@ -1,12 +1,13 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require 'monitor'
4
+
3
5
  module Pikuri
4
6
  module VectorDb
5
7
  module Backend
6
- # Pure-Ruby vector store. The educational default backend;
7
- # IDEAS.md §"Vector DB / RAG" frames it as the "small enough
8
- # to audit" first stop the demo + guide walk through before
9
- # promoting users to +Chroma+ for persistence.
8
+ # Pure-Ruby vector store. The educational default backend
9
+ # the "small enough to audit" first stop the demo + guide walk
10
+ # through before promoting users to +Chroma+ for persistence.
10
11
  #
11
12
  # == What it does
12
13
  #
@@ -30,9 +31,23 @@ module Pikuri
30
31
  # * **No approximate search.** Exhaustive scan. Approximate
31
32
  # nearest neighbor (HNSW, IVF) adds complexity that doesn't
32
33
  # teach anything additional once the cosine math is clear.
33
- # * **No thread safety.** {Indexer} runs single-threaded
34
- # during a boot or reindex; {Search} calls +#query+ from
35
- # the agent's main thread. No concurrent access today.
34
+ # * **No approximate-search index.** Exhaustive scan only.
35
+ #
36
+ # == Thread safety
37
+ #
38
+ # Every public method runs under a single reentrant
39
+ # +Monitor+. The agent's main thread calls +#query+ while a
40
+ # background {Watcher} thread calls +#replace_source+ /
41
+ # +#delete_by_source+, so concurrent access is real once
42
+ # auto-watch is wired. The lock's load-bearing job is
43
+ # {#replace_source}: it holds the monitor across the
44
+ # delete-then-upsert so a concurrent +#query+ never observes
45
+ # the gap where a source has zero chunks. +Monitor+ (not a
46
+ # bare +Mutex+) because +#replace_source+ re-enters the lock
47
+ # via +#delete_by_source+ + +#upsert+, which a non-reentrant
48
+ # +Mutex+ would deadlock on. +Chroma+ needs no client-side
49
+ # lock — the server serializes — so this is the one backend
50
+ # that locks.
36
51
  #
37
52
  # == Cosine, not dot product
38
53
  #
@@ -53,6 +68,10 @@ module Pikuri
53
68
  # enforced for every subsequent +#upsert+ + +#query+ — see
54
69
  # the Backend protocol's "Vector-dim contract" yardoc.
55
70
  @dim = nil
71
+ # Reentrant so +#replace_source+ can call +#delete_by_source+
72
+ # + +#upsert+ while holding the lock — see the class yardoc's
73
+ # "Thread safety" section.
74
+ @lock = Monitor.new
56
75
  end
57
76
 
58
77
  # Insert-or-replace by +chunk.id+. Parallel arrays of
@@ -71,15 +90,17 @@ module Pikuri
71
90
  raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors"
72
91
  end
73
92
 
74
- expected = @dim || vectors.first.size
75
- vectors.each_with_index do |v, i|
76
- next if v.size == expected
93
+ @lock.synchronize do
94
+ expected = @dim || vectors.first.size
95
+ vectors.each_with_index do |v, i|
96
+ next if v.size == expected
77
97
 
78
- raise ArgumentError, "vector #{i} has dim #{v.size}, expected #{expected}"
79
- end
80
- @dim ||= expected
98
+ raise ArgumentError, "vector #{i} has dim #{v.size}, expected #{expected}"
99
+ end
100
+ @dim ||= expected
81
101
 
82
- chunks.zip(vectors).each { |chunk, vector| @entries[chunk.id] = [chunk, vector] }
102
+ chunks.zip(vectors).each { |chunk, vector| @entries[chunk.id] = [chunk, vector] }
103
+ end
83
104
  nil
84
105
  end
85
106
 
@@ -96,16 +117,19 @@ module Pikuri
96
117
  # dim mismatch.
97
118
  def query(vector:, top_k:)
98
119
  raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0
99
- return [] if @entries.empty?
100
120
 
101
- if vector.size != @dim
102
- raise ArgumentError, "query vector dim #{vector.size}, stored dim #{@dim}"
103
- end
121
+ @lock.synchronize do
122
+ return [] if @entries.empty?
104
123
 
105
- scored = @entries.values.map do |chunk, stored|
106
- Result.new(chunk: chunk, score: cosine(vector, stored))
124
+ if vector.size != @dim
125
+ raise ArgumentError, "query vector dim #{vector.size}, stored dim #{@dim}"
126
+ end
127
+
128
+ scored = @entries.values.map do |chunk, stored|
129
+ Result.new(chunk: chunk, score: cosine(vector, stored))
130
+ end
131
+ scored.sort_by { |r| -r.score }.first(top_k)
107
132
  end
108
- scored.sort_by { |r| -r.score }.first(top_k)
109
133
  end
110
134
 
111
135
  # Drop every stored chunk. Used by the v1 nuke-and-reload
@@ -114,14 +138,89 @@ module Pikuri
114
138
  #
115
139
  # @return [void]
116
140
  def delete_all
117
- @entries.clear
118
- @dim = nil
141
+ @lock.synchronize do
142
+ @entries.clear
143
+ @dim = nil
144
+ end
119
145
  nil
120
146
  end
121
147
 
122
148
  # @return [Integer] current chunk count.
123
149
  def count
124
- @entries.size
150
+ @lock.synchronize { @entries.size }
151
+ end
152
+
153
+ # Remove every chunk whose +source+ matches. The scoped
154
+ # counterpart to {#delete_all} — drops one document's chunks
155
+ # without touching the rest. No-op (and no error) when the
156
+ # source isn't present. The dim lock is left intact: unlike
157
+ # {#delete_all}, a per-source delete doesn't imply an
158
+ # embedder change.
159
+ #
160
+ # @param source [String] the {Chunk#source} to purge, e.g.
161
+ # +"notes/cooking.md"+.
162
+ # @return [void]
163
+ def delete_by_source(source)
164
+ @lock.synchronize do
165
+ @entries.reject! { |_id, (chunk, _vector)| chunk.source == source }
166
+ end
167
+ nil
168
+ end
169
+
170
+ # Atomically replace all chunks for one +source+: delete the
171
+ # old set, then upsert the new one, under a single hold of the
172
+ # monitor. The incremental-reindex unit (see {Indexer#reindex_file!}).
173
+ # Holding the lock across both halves is the point — a
174
+ # concurrent {#query} sees either the old chunks or the new
175
+ # ones, never the empty gap between.
176
+ #
177
+ # @param source [String] the {Chunk#source} being replaced.
178
+ # @param chunks [Array<Chunk>] the new chunk set; every
179
+ # +chunk.source+ should equal +source+.
180
+ # @param vectors [Array<Array<Float>>] parallel to +chunks+.
181
+ # @return [void]
182
+ # @raise [ArgumentError] on empty input, length mismatch, or
183
+ # vector-dim mismatch (from the inner {#upsert}).
184
+ def replace_source(source:, chunks:, vectors:)
185
+ @lock.synchronize do
186
+ delete_by_source(source)
187
+ upsert(chunks: chunks, vectors: vectors)
188
+ end
189
+ nil
190
+ end
191
+
192
+ # The boot-sweep reference: a map from each indexed +source+
193
+ # to the content hash stored on its chunks. {Watcher} (via
194
+ # {Indexer#reconcile_plan}) diffs this against the hashes of
195
+ # the files currently on disk to decide what to reindex.
196
+ # Built from chunk metadata; a chunk indexed before the
197
+ # +hash+ metadata existed maps its source to +nil+, which the
198
+ # diff treats as "changed" and reindexes — self-healing.
199
+ #
200
+ # @return [Hash{String => String, nil}] +source+ → content
201
+ # hash. Empty when nothing is indexed (the InMemory case at
202
+ # every boot, since RAM resets).
203
+ def sources_with_hashes
204
+ @lock.synchronize do
205
+ result = {}
206
+ @entries.each_value do |chunk, _vector|
207
+ result[chunk.source] ||= chunk.metadata[:hash]
208
+ end
209
+ result
210
+ end
211
+ end
212
+
213
+ # Is +source+ in the corpus? The scoped membership test
214
+ # behind {VectorDb::Tools::Read}'s gate — a short-circuiting scan
215
+ # rather than building the whole {#sources_with_hashes} map
216
+ # just to read one key. See the Backend protocol yardoc.
217
+ #
218
+ # @param source [String] the {Chunk#source} to test.
219
+ # @return [Boolean] true if at least one chunk has this source.
220
+ def source_indexed?(source)
221
+ @lock.synchronize do
222
+ @entries.each_value.any? { |chunk, _vector| chunk.source == source }
223
+ end
125
224
  end
126
225
 
127
226
  private