pikuri-vectordb 0.0.4 → 0.0.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +144 -72
- data/lib/pikuri/vector_db/backend/chroma.rb +138 -1
- data/lib/pikuri/vector_db/backend/in_memory.rb +123 -24
- data/lib/pikuri/vector_db/backend/qdrant.rb +446 -0
- data/lib/pikuri/vector_db/backend/result.rb +3 -3
- data/lib/pikuri/vector_db/backend.rb +37 -7
- data/lib/pikuri/vector_db/chunk.rb +15 -10
- data/lib/pikuri/vector_db/extension.rb +59 -57
- data/lib/pikuri/vector_db/indexer.rb +202 -14
- data/lib/pikuri/vector_db/librarian.rb +27 -19
- data/lib/pikuri/vector_db/reranker/llama_server.rb +1 -1
- data/lib/pikuri/vector_db/reranker.rb +3 -4
- data/lib/pikuri/vector_db/server/chroma.rb +184 -0
- data/lib/pikuri/vector_db/server/docker_container.rb +266 -0
- data/lib/pikuri/vector_db/server/in_memory.rb +98 -0
- data/lib/pikuri/vector_db/server/qdrant.rb +177 -0
- data/lib/pikuri/vector_db/server.rb +35 -0
- data/lib/pikuri/vector_db/tools/read.rb +206 -0
- data/lib/pikuri/vector_db/tools/reindex.rb +94 -0
- data/lib/pikuri/vector_db/tools/search.rb +202 -0
- data/lib/pikuri/vector_db/tools.rb +19 -0
- data/lib/pikuri/vector_db/watcher.rb +353 -0
- data/lib/pikuri-vectordb.rb +13 -8
- data/prompts/persona-librarian.txt +5 -4
- data/prompts/pikuri-corpus.txt +21 -0
- metadata +35 -12
- data/lib/pikuri/vector_db/chroma_server.rb +0 -309
- data/lib/pikuri/vector_db/reindex.rb +0 -86
- data/lib/pikuri/vector_db/search.rb +0 -201
- data/prompts/pikuri-librarian.txt +0 -22
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 2560026784f6f00e160c090718c8896835b129fb6d81e4efa1570917aa2414d5
|
|
4
|
+
data.tar.gz: 22e7054b58ae018b1b4afcde583c70ad979aa4e859348892d3a0640e0541c288
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 9c957518955fa72da1d998f041422a32b78301e488c3a5e93eb13cda4a779a81ae83b3e9ddd3041c5972973b9684c039760f77fd6c7f82f2e94a177e361d66b8
|
|
7
|
+
data.tar.gz: 420446ae9178260b0cc9e0f4fa2950c9330a74c298e83aad9b1d76e73809f67a99b9cd460a177353f63560871b2bf412fd5d3d3e7f9d22360a458946e538aabb
|
data/README.md
CHANGED
|
@@ -1,80 +1,138 @@
|
|
|
1
1
|
# pikuri-vectordb
|
|
2
2
|
|
|
3
3
|
Local-corpus vector search + agentic RAG for the
|
|
4
|
-
[pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
> RAG" for the design.
|
|
10
|
-
|
|
11
|
-
Will provide:
|
|
12
|
-
|
|
13
|
-
- `Pikuri::VectorDb::Extension` — wires a `vectordb_search` tool +
|
|
14
|
-
a `vectordb_reindex` tool onto a `Pikuri::Agent` via
|
|
15
|
-
`c.add_extension(...)` inside the `Agent.new` block.
|
|
16
|
-
- `Pikuri::VectorDb::Backend::InMemory` — pure-Ruby cosine over
|
|
17
|
-
`Array<Float>`. The educational default; reads in ~40 lines.
|
|
18
|
-
RAM-only; everything reloads from sources on every boot.
|
|
19
|
-
- `Pikuri::VectorDb::Backend::Chroma` — thin Faraday HTTP client
|
|
20
|
-
against a self-hosted ChromaDB. The persistent option.
|
|
21
|
-
- `Pikuri::VectorDb::Embedder` — thin wrapper over `RubyLLM.embed`
|
|
22
|
-
so tests can inject a fake without monkey-patching ruby_llm.
|
|
23
|
-
- `Pikuri::VectorDb::Reranker::LlamaServer` — optional quality
|
|
24
|
-
knob. Speaks `/v1/rerank` against a cross-encoder model on a
|
|
25
|
-
llama.cpp server. Passing `reranker: nil` to the extension
|
|
26
|
-
skips reranking; retrieval falls back to vector-only top-k.
|
|
27
|
-
- `Pikuri::VectorDb::Chunker::FixedWindow` + `Tokenizer::*` —
|
|
28
|
-
the chunking pipeline. Tokenizer is a duck-typed protocol
|
|
29
|
-
(`count(text) -> Integer`) with two impls in v1:
|
|
30
|
-
`Tokenizer::CharHeuristic` (default, ~4 chars/token rule) and
|
|
31
|
-
`Tokenizer::LlamaServer` (POST `/tokenize` against the
|
|
32
|
-
embedder's endpoint).
|
|
33
|
-
- Text extraction reuses `Pikuri::FileType.read_as_text` from
|
|
34
|
-
pikuri-core — plain text / Markdown / PDF. HTML extraction
|
|
35
|
-
is a deferred follow-up; v1 corpora skew toward Markdown
|
|
36
|
-
notes and PDF docs in practice.
|
|
37
|
-
- `Pikuri::VectorDb::LIBRARIAN` — bundled
|
|
38
|
-
`Pikuri::SubAgent::Persona` constant. Hosts wire it via
|
|
39
|
-
`SubAgent::Extension.new(personas: [..., LIBRARIAN])` — same
|
|
40
|
-
shape `pikuri-code` uses for `GIT_REPO_RESEARCHER`.
|
|
4
|
+
[pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit:
|
|
5
|
+
semantic recall over a pile of files you point it at — your notes,
|
|
6
|
+
your docs, your contracts — where the *agent* decides when to
|
|
7
|
+
retrieve, same Thought → Tool-call → Observation loop as every other
|
|
8
|
+
tool.
|
|
41
9
|
|
|
42
|
-
|
|
10
|
+
Wire it onto a `pikuri-core` agent the same way as `pikuri-tasks` /
|
|
11
|
+
`pikuri-memory` — `c.add_extension` inside the `Agent.new` block:
|
|
43
12
|
|
|
44
13
|
```ruby
|
|
45
|
-
|
|
46
|
-
|
|
14
|
+
require 'pikuri-vectordb'
|
|
15
|
+
|
|
16
|
+
Pikuri::Agent.new(transport: ..., system_prompt: ...) do |c|
|
|
17
|
+
c.add_extension Pikuri::VectorDb::Extension.new(
|
|
18
|
+
backend: Pikuri::VectorDb::Backend::InMemory.new,
|
|
19
|
+
source: '~/notes'
|
|
20
|
+
)
|
|
21
|
+
end
|
|
47
22
|
```
|
|
48
23
|
|
|
49
|
-
##
|
|
24
|
+
## What you get
|
|
25
|
+
|
|
26
|
+
Three tools, registered by the extension:
|
|
27
|
+
|
|
28
|
+
1. **`vectordb_search`** — embeds the query, pulls the top-k nearest
|
|
29
|
+
chunks from the backend, optionally reranks them with a
|
|
30
|
+
cross-encoder, and hands the agent a numbered list of
|
|
31
|
+
`source (score=…)` snippets as its next observation.
|
|
32
|
+
2. **`vectordb_read`** — parent-document retrieval: when a search
|
|
33
|
+
surfaces a clean hit, the agent reads that whole document by its
|
|
34
|
+
`source` path instead of re-querying for more fragments of it.
|
|
35
|
+
3. **`vectordb_reindex`** — rebuilds the index from the source, on
|
|
36
|
+
request.
|
|
37
|
+
|
|
38
|
+
The extension registers the tools and nothing else — *populating*
|
|
39
|
+
the index is the host's call, never something done behind your
|
|
40
|
+
back. Three equally valid shapes:
|
|
41
|
+
|
|
42
|
+
- Index at boot: `extension.indexer.index_if_empty!`.
|
|
43
|
+
- Keep it live: run a `Pikuri::VectorDb::Watcher` around
|
|
44
|
+
`extension.indexer` — a filesystem-event daemon (the `listen`
|
|
45
|
+
gem) that sweeps once on boot and reindexes files as they change.
|
|
46
|
+
- Leave it empty and let the user drive: the agent calls
|
|
47
|
+
`vectordb_reindex` when asked.
|
|
48
|
+
|
|
49
|
+
## Backends
|
|
50
|
+
|
|
51
|
+
Three implementations of one duck-typed interface
|
|
52
|
+
(`#upsert` / `#query` / `#delete_all` / `#count`) — swapping is a
|
|
53
|
+
one-line change:
|
|
54
|
+
|
|
55
|
+
- `Backend::InMemory` — the educational default. Pure-Ruby cosine
|
|
56
|
+
over `Array<Float>`, ~40 lines, reads in one sitting. RAM-only:
|
|
57
|
+
everything reloads from sources on every boot.
|
|
58
|
+
- `Backend::Qdrant` — thin Faraday HTTP client against a
|
|
59
|
+
self-hosted [Qdrant](https://qdrant.tech). **The recommended
|
|
60
|
+
persistent backend** — [`DESIGN.md`](DESIGN.md) has the engine
|
|
61
|
+
survey behind the pick.
|
|
62
|
+
- `Backend::Chroma` — the supported
|
|
63
|
+
[ChromaDB](https://www.trychroma.com) alternative, identical
|
|
64
|
+
wiring.
|
|
65
|
+
|
|
66
|
+
Each persistent engine pairs with a `Server::*` supervisor that
|
|
67
|
+
runs it as a self-managed docker container: pinned image, a
|
|
68
|
+
container name pikuri owns (`pikuri-internal-qdrant` /
|
|
69
|
+
`pikuri-internal-chroma`), data bind-mounted under
|
|
70
|
+
`~/.cache/pikuri/` so the corpus survives container recreation.
|
|
50
71
|
|
|
51
72
|
```ruby
|
|
52
|
-
|
|
53
|
-
|
|
73
|
+
# Supervised container (needs docker on PATH):
|
|
74
|
+
backend = Pikuri::VectorDb::Server::Qdrant.ensure_running.client(
|
|
75
|
+
collection: 'my-docs'
|
|
76
|
+
)
|
|
54
77
|
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
78
|
+
# Or point at a Qdrant you already run:
|
|
79
|
+
backend = Pikuri::VectorDb::Backend::Qdrant.new(
|
|
80
|
+
host: 'localhost', port: 6333, collection: 'my-docs'
|
|
81
|
+
)
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
Collection naming is engine-specific so it lives on the backend
|
|
85
|
+
constructor, not on the Extension — `Backend::InMemory` has no
|
|
86
|
+
collection concept.
|
|
87
|
+
|
|
88
|
+
## The indexing pipeline
|
|
89
|
+
|
|
90
|
+
What `vectordb_reindex` (and the `Watcher`) actually runs, piece by
|
|
91
|
+
piece — each swappable via the Extension's keyword arguments:
|
|
92
|
+
|
|
93
|
+
- **Chunker** (`Chunker::FixedWindow`) — overlapping windows,
|
|
94
|
+
default 512 tokens with 50 of overlap, so an answer straddling a
|
|
95
|
+
boundary survives in at least one chunk.
|
|
96
|
+
- **Tokenizer** (`Tokenizer::CharHeuristic` default /
|
|
97
|
+
`Tokenizer::LlamaServer`) — counts tokens for the chunker; the
|
|
98
|
+
heuristic is the offline ~4-chars-per-token rule, the
|
|
99
|
+
`LlamaServer` variant asks the embedder's `/tokenize` endpoint
|
|
100
|
+
for an exact count.
|
|
101
|
+
- **Embedder** — thin wrapper over `RubyLLM.embed`; tests inject a
|
|
102
|
+
fake `#embed` without monkey-patching ruby_llm.
|
|
103
|
+
- **Reranker** (`Reranker::LlamaServer`, optional) — cross-encoder
|
|
104
|
+
over `POST /v1/rerank`. Pass `reranker: nil` to skip it;
|
|
105
|
+
retrieval falls back to vector-only top-k — less precision, same
|
|
106
|
+
correctness.
|
|
107
|
+
|
|
108
|
+
Text extraction reuses `Pikuri::FileType.read_as_text` from
|
|
109
|
+
pikuri-core — plain text / Markdown / PDF. HTML extraction is a
|
|
110
|
+
deferred follow-up.
|
|
111
|
+
|
|
112
|
+
## Demo: `pikuri-corpus`
|
|
113
|
+
|
|
114
|
+
From a source checkout (not installed by `gem install`):
|
|
115
|
+
|
|
116
|
+
```sh
|
|
117
|
+
./pikuri-vectordb/bin/pikuri-corpus --qdrant --watch
|
|
68
118
|
```
|
|
69
119
|
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
`
|
|
120
|
+
A single recall agent over `docs/guide/` (the pikuri guide itself)
|
|
121
|
+
with **no egress** — its tools are the three above plus
|
|
122
|
+
`calculator`; no web search, no fetch, no bash. The corpus stands
|
|
123
|
+
in for private data, and an agent that can read it must not also be
|
|
124
|
+
able to send it out. `--qdrant` / `--chroma` persist the index
|
|
125
|
+
across runs, `--watch` keeps it live, `--no-reranker` drops the
|
|
126
|
+
reranker requirement. The guide's
|
|
127
|
+
[chapter 3](../docs/guide/03-vectordb.md) is the full walkthrough.
|
|
73
128
|
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
129
|
+
## The LIBRARIAN persona
|
|
130
|
+
|
|
131
|
+
For hosts that want recall behind a privilege-separated sub-agent —
|
|
132
|
+
the right shape once the *parent* agent has egress (see
|
|
133
|
+
`SECURITY.md` at the repo root) — the bundled
|
|
134
|
+
`Pikuri::VectorDb::LIBRARIAN` persona is opt-in via
|
|
135
|
+
`pikuri-subagents`:
|
|
78
136
|
|
|
79
137
|
```ruby
|
|
80
138
|
require 'pikuri-subagents'
|
|
@@ -88,13 +146,13 @@ c.add_extension(
|
|
|
88
146
|
|
|
89
147
|
## Three model endpoints
|
|
90
148
|
|
|
91
|
-
A full
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
149
|
+
A full setup wants three LLM endpoints: chat (via `ruby_llm`), an
|
|
150
|
+
embedder (via `RubyLLM.embed`), and an optional reranker (HTTP
|
|
151
|
+
`/v1/rerank`). Recommended setup: **one `llama-server` running in
|
|
152
|
+
router mode** — started with no `--model` flag, it serves every
|
|
153
|
+
GGUF in `~/.cache/llama.cpp/` from a single port and loads
|
|
154
|
+
whichever model each request asks for. Requires a recent enough
|
|
155
|
+
`llama.cpp` build to include the
|
|
98
156
|
[model-management feature](https://huggingface.co/blog/ggml-org/model-management-in-llamacpp);
|
|
99
157
|
Ubuntu 26.04+ packages one. The guide's
|
|
100
158
|
[chapter 1](../docs/guide/01-chat.md) walks through the setup;
|
|
@@ -112,10 +170,24 @@ OpenAI-compatible endpoints and would also work, but pikuri's
|
|
|
112
170
|
"small enough to audit" ethos keeps the recommended path on
|
|
113
171
|
`llama.cpp` alone.
|
|
114
172
|
|
|
173
|
+
## Install
|
|
174
|
+
|
|
175
|
+
```ruby
|
|
176
|
+
# Gemfile
|
|
177
|
+
gem 'pikuri-vectordb'
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
Depends on `pikuri-core`, `pikuri-subagents` (the `Persona` value
|
|
181
|
+
type `LIBRARIAN` is an instance of), and `listen` (filesystem
|
|
182
|
+
events for the `Watcher`; loaded only when a `Watcher` starts).
|
|
183
|
+
|
|
115
184
|
## Further reading
|
|
116
185
|
|
|
117
|
-
- **
|
|
118
|
-
|
|
186
|
+
- **Guide chapter:** [Agentic search and the vector
|
|
187
|
+
DB](../docs/guide/03-vectordb.md) — concepts, model setup, the
|
|
188
|
+
no-egress argument, `--qdrant --watch` day-to-day shape.
|
|
189
|
+
- **Design notes:** [`DESIGN.md`](DESIGN.md) — the Chroma-vs-Qdrant
|
|
190
|
+
engine survey.
|
|
119
191
|
- **API reference:** browse the YARD docs at
|
|
120
192
|
<https://rubydoc.info/gems/pikuri-vectordb> (once published),
|
|
121
193
|
or run `bundle exec yard` in this directory for a local copy.
|
|
@@ -13,6 +13,14 @@ module Pikuri
|
|
|
13
13
|
# on empty input + non-positive +top_k+. Where the two
|
|
14
14
|
# diverge is the vector-dim contract — see below.
|
|
15
15
|
#
|
|
16
|
+
# The client is hand-rolled rather than a dependency on a
|
|
17
|
+
# +chroma-db+ gem: only a handful of v2 endpoints are needed
|
|
18
|
+
# (listed below), Faraday is already in the dependency closure,
|
|
19
|
+
# and a thin first-party client keeps the wire protocol
|
|
20
|
+
# auditable in one readable file — consistent with the
|
|
21
|
+
# read-it-in-an-evening ceiling. The cost is tracking Chroma's
|
|
22
|
+
# v2 API by hand if it changes.
|
|
23
|
+
#
|
|
16
24
|
# == Two ways to get one
|
|
17
25
|
#
|
|
18
26
|
# * **Bring your own.** +Backend::Chroma.new(host:, port:,
|
|
@@ -21,7 +29,7 @@ module Pikuri
|
|
|
21
29
|
# already running on the host for an unrelated project).
|
|
22
30
|
# The host owns the process; this class is purely the
|
|
23
31
|
# HTTP client.
|
|
24
|
-
# * **Let pikuri manage it.** {
|
|
32
|
+
# * **Let pikuri manage it.** {Server::Chroma.ensure_running}
|
|
25
33
|
# spawns and supervises a chroma container under the
|
|
26
34
|
# +pikuri-internal-chroma+ name, against a pinned image,
|
|
27
35
|
# with a bind-mounted volume in the user's cache dir.
|
|
@@ -46,6 +54,11 @@ module Pikuri
|
|
|
46
54
|
# count.
|
|
47
55
|
# * +DELETE /api/v2/.../collections/{id}+ — drop the
|
|
48
56
|
# collection (used by +#delete_all+).
|
|
57
|
+
# * +POST /api/v2/.../collections/{id}/delete+ — metadata-
|
|
58
|
+
# filtered delete (+{where:}+); used by +#delete_by_source+.
|
|
59
|
+
# * +POST /api/v2/.../collections/{id}/get+ — fetch rows by
|
|
60
|
+
# +{where:}+ filter with an +include:+ projection; used by
|
|
61
|
+
# +#sources_with_hashes+.
|
|
49
62
|
#
|
|
50
63
|
# == BYO embeddings (not Chroma's embedder)
|
|
51
64
|
#
|
|
@@ -110,6 +123,15 @@ module Pikuri
|
|
|
110
123
|
# drift. Real-Chroma smoke testing is wired into the demo
|
|
111
124
|
# binary in a later phase. Targets Chroma 0.5.x+ (v2 API).
|
|
112
125
|
class Chroma
|
|
126
|
+
# Rows per +/get+ page in {#sources_with_hashes}. Caps the
|
|
127
|
+
# JSON burst + parse working set of the boot manifest read on
|
|
128
|
+
# a large corpus; small corpora finish in one page. Chunky but
|
|
129
|
+
# not arbitrary — one round trip per this-many *files*, and the
|
|
130
|
+
# manifest is one row per file (the +offset 0+ chunk), so a
|
|
131
|
+
# 50k-file corpus is ~50 localhost round trips instead of one
|
|
132
|
+
# multi-MB response.
|
|
133
|
+
MANIFEST_PAGE_SIZE = 1_000
|
|
134
|
+
|
|
113
135
|
# @param host [String]
|
|
114
136
|
# @param port [Integer]
|
|
115
137
|
# @param collection [String] collection name in Chroma.
|
|
@@ -271,6 +293,121 @@ module Pikuri
|
|
|
271
293
|
raise "Backend::Chroma: count response was not an Integer (got #{body.inspect})"
|
|
272
294
|
end
|
|
273
295
|
|
|
296
|
+
# Remove every chunk whose +source+ matches, via a
|
|
297
|
+
# metadata-filtered +POST .../delete+ (+source+ is the
|
|
298
|
+
# reserved metadata key {#upsert} writes). The scoped
|
|
299
|
+
# counterpart to {#delete_all}. No-op when the collection
|
|
300
|
+
# doesn't exist yet.
|
|
301
|
+
#
|
|
302
|
+
# @param source [String] the {Chunk#source} to purge.
|
|
303
|
+
# @return [void]
|
|
304
|
+
# @raise [RuntimeError] on HTTP failure.
|
|
305
|
+
def delete_by_source(source)
|
|
306
|
+
return nil if @collection_id.nil? && !collection_exists?
|
|
307
|
+
|
|
308
|
+
post_json("#{collection_path}/delete", { where: { 'source' => source } })
|
|
309
|
+
nil
|
|
310
|
+
end
|
|
311
|
+
|
|
312
|
+
# Replace all chunks for one +source+: delete the old set,
|
|
313
|
+
# then upsert the new one. The incremental-reindex unit
|
|
314
|
+
# (see {Indexer#reindex_file!}).
|
|
315
|
+
#
|
|
316
|
+
# == Not transactional (the InMemory divergence)
|
|
317
|
+
#
|
|
318
|
+
# These are two HTTP calls, so a +#query+ landing between
|
|
319
|
+
# them can see the source with zero chunks — a window
|
|
320
|
+
# {InMemory#replace_source} closes with its monitor but
|
|
321
|
+
# Chroma cannot, short of server-side transactions it
|
|
322
|
+
# doesn't expose. The window is small and the
|
|
323
|
+
# {Indexer} mitigates the *common* failure: it embeds
|
|
324
|
+
# before calling here, so an embedder outage never reaches
|
|
325
|
+
# this method and the old chunks stay put. Delete-then-upsert
|
|
326
|
+
# (not the reverse): upserting first then deleting by source
|
|
327
|
+
# would delete the just-written chunks.
|
|
328
|
+
#
|
|
329
|
+
# @param source [String] the {Chunk#source} being replaced.
|
|
330
|
+
# @param chunks [Array<Chunk>] the new chunk set.
|
|
331
|
+
# @param vectors [Array<Array<Float>>] parallel to +chunks+.
|
|
332
|
+
# @return [void]
|
|
333
|
+
# @raise [ArgumentError] on empty input or length mismatch.
|
|
334
|
+
# @raise [RuntimeError] on HTTP failure.
|
|
335
|
+
def replace_source(source:, chunks:, vectors:)
|
|
336
|
+
delete_by_source(source)
|
|
337
|
+
upsert(chunks: chunks, vectors: vectors)
|
|
338
|
+
nil
|
|
339
|
+
end
|
|
340
|
+
|
|
341
|
+
# The boot-sweep reference: +source+ → stored content hash
|
|
342
|
+
# for every indexed document. Reads one metadata row per
|
|
343
|
+
# *file*, not per chunk, via three Chroma +/get+ knobs:
|
|
344
|
+
#
|
|
345
|
+
# * +where: { offset: 0 }+ — every file has exactly one
|
|
346
|
+
# chunk at offset 0, so this returns one row per source.
|
|
347
|
+
# * +include: ['metadatas']+ — drops the heavy +embeddings+
|
|
348
|
+
# and +documents+ from the response; we pull only the
|
|
349
|
+
# metadata projection, never the vectors.
|
|
350
|
+
# * +limit+ / +offset+ — page the read in
|
|
351
|
+
# {MANIFEST_PAGE_SIZE} chunks so a large corpus never
|
|
352
|
+
# materializes one multi-MB response. (Two unrelated
|
|
353
|
+
# +offset+s collide in the wording: the +where+ +offset+ is
|
|
354
|
+
# a *chunk* metadata field; the top-level +offset+ is the
|
|
355
|
+
# *pagination* cursor — different namespaces in the API.)
|
|
356
|
+
#
|
|
357
|
+
# Pagination assumes the manifest isn't mutating mid-read; the
|
|
358
|
+
# {Watcher} drives this from its single worker thread, so no
|
|
359
|
+
# reindex runs concurrently with the boot sweep that calls it.
|
|
360
|
+
#
|
|
361
|
+
# @return [Hash{String => String, nil}] +source+ → content
|
|
362
|
+
# hash. Empty when the collection doesn't exist yet.
|
|
363
|
+
# @raise [RuntimeError] on HTTP failure.
|
|
364
|
+
def sources_with_hashes
|
|
365
|
+
return {} if @collection_id.nil? && !collection_exists?
|
|
366
|
+
|
|
367
|
+
result = {}
|
|
368
|
+
cursor = 0
|
|
369
|
+
loop do
|
|
370
|
+
body = post_json("#{collection_path}/get", {
|
|
371
|
+
where: { 'offset' => 0 },
|
|
372
|
+
include: ['metadatas'],
|
|
373
|
+
limit: MANIFEST_PAGE_SIZE,
|
|
374
|
+
offset: cursor
|
|
375
|
+
})
|
|
376
|
+
metas = body.is_a?(Hash) ? (body['metadatas'] || []) : []
|
|
377
|
+
metas.each do |meta|
|
|
378
|
+
next unless meta.is_a?(Hash) && meta['source']
|
|
379
|
+
|
|
380
|
+
result[meta['source']] = meta['hash']
|
|
381
|
+
end
|
|
382
|
+
break if metas.size < MANIFEST_PAGE_SIZE
|
|
383
|
+
|
|
384
|
+
cursor += metas.size
|
|
385
|
+
end
|
|
386
|
+
result
|
|
387
|
+
end
|
|
388
|
+
|
|
389
|
+
# Is +source+ in the corpus? Scoped existence check for
|
|
390
|
+
# {VectorDb::Tools::Read}'s membership gate: a +where+-filtered
|
|
391
|
+
# +/get+ capped at one row, +include: []+ so the response
|
|
392
|
+
# carries only ids — O(1) transport regardless of corpus
|
|
393
|
+
# size, never the full {#sources_with_hashes} manifest. See
|
|
394
|
+
# the Backend protocol yardoc.
|
|
395
|
+
#
|
|
396
|
+
# @param source [String] the {Chunk#source} to test.
|
|
397
|
+
# @return [Boolean] true if at least one chunk has this source.
|
|
398
|
+
# @raise [RuntimeError] on HTTP failure.
|
|
399
|
+
def source_indexed?(source)
|
|
400
|
+
return false if @collection_id.nil? && !collection_exists?
|
|
401
|
+
|
|
402
|
+
body = post_json("#{collection_path}/get", {
|
|
403
|
+
where: { 'source' => source },
|
|
404
|
+
include: [],
|
|
405
|
+
limit: 1
|
|
406
|
+
})
|
|
407
|
+
ids = body.is_a?(Hash) ? (body['ids'] || []) : []
|
|
408
|
+
!ids.empty?
|
|
409
|
+
end
|
|
410
|
+
|
|
274
411
|
private
|
|
275
412
|
|
|
276
413
|
def collections_path
|
|
@@ -1,12 +1,13 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
+
require 'monitor'
|
|
4
|
+
|
|
3
5
|
module Pikuri
|
|
4
6
|
module VectorDb
|
|
5
7
|
module Backend
|
|
6
|
-
# Pure-Ruby vector store. The educational default backend
|
|
7
|
-
#
|
|
8
|
-
#
|
|
9
|
-
# promoting users to +Chroma+ for persistence.
|
|
8
|
+
# Pure-Ruby vector store. The educational default backend —
|
|
9
|
+
# the "small enough to audit" first stop the demo + guide walk
|
|
10
|
+
# through before promoting users to +Chroma+ for persistence.
|
|
10
11
|
#
|
|
11
12
|
# == What it does
|
|
12
13
|
#
|
|
@@ -30,9 +31,23 @@ module Pikuri
|
|
|
30
31
|
# * **No approximate search.** Exhaustive scan. Approximate
|
|
31
32
|
# nearest neighbor (HNSW, IVF) adds complexity that doesn't
|
|
32
33
|
# teach anything additional once the cosine math is clear.
|
|
33
|
-
# * **No
|
|
34
|
-
#
|
|
35
|
-
#
|
|
34
|
+
# * **No approximate-search index.** Exhaustive scan only.
|
|
35
|
+
#
|
|
36
|
+
# == Thread safety
|
|
37
|
+
#
|
|
38
|
+
# Every public method runs under a single reentrant
|
|
39
|
+
# +Monitor+. The agent's main thread calls +#query+ while a
|
|
40
|
+
# background {Watcher} thread calls +#replace_source+ /
|
|
41
|
+
# +#delete_by_source+, so concurrent access is real once
|
|
42
|
+
# auto-watch is wired. The lock's load-bearing job is
|
|
43
|
+
# {#replace_source}: it holds the monitor across the
|
|
44
|
+
# delete-then-upsert so a concurrent +#query+ never observes
|
|
45
|
+
# the gap where a source has zero chunks. +Monitor+ (not a
|
|
46
|
+
# bare +Mutex+) because +#replace_source+ re-enters the lock
|
|
47
|
+
# via +#delete_by_source+ + +#upsert+, which a non-reentrant
|
|
48
|
+
# +Mutex+ would deadlock on. +Chroma+ needs no client-side
|
|
49
|
+
# lock — the server serializes — so this is the one backend
|
|
50
|
+
# that locks.
|
|
36
51
|
#
|
|
37
52
|
# == Cosine, not dot product
|
|
38
53
|
#
|
|
@@ -53,6 +68,10 @@ module Pikuri
|
|
|
53
68
|
# enforced for every subsequent +#upsert+ + +#query+ — see
|
|
54
69
|
# the Backend protocol's "Vector-dim contract" yardoc.
|
|
55
70
|
@dim = nil
|
|
71
|
+
# Reentrant so +#replace_source+ can call +#delete_by_source+
|
|
72
|
+
# + +#upsert+ while holding the lock — see the class yardoc's
|
|
73
|
+
# "Thread safety" section.
|
|
74
|
+
@lock = Monitor.new
|
|
56
75
|
end
|
|
57
76
|
|
|
58
77
|
# Insert-or-replace by +chunk.id+. Parallel arrays of
|
|
@@ -71,15 +90,17 @@ module Pikuri
|
|
|
71
90
|
raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors"
|
|
72
91
|
end
|
|
73
92
|
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
93
|
+
@lock.synchronize do
|
|
94
|
+
expected = @dim || vectors.first.size
|
|
95
|
+
vectors.each_with_index do |v, i|
|
|
96
|
+
next if v.size == expected
|
|
77
97
|
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
98
|
+
raise ArgumentError, "vector #{i} has dim #{v.size}, expected #{expected}"
|
|
99
|
+
end
|
|
100
|
+
@dim ||= expected
|
|
81
101
|
|
|
82
|
-
|
|
102
|
+
chunks.zip(vectors).each { |chunk, vector| @entries[chunk.id] = [chunk, vector] }
|
|
103
|
+
end
|
|
83
104
|
nil
|
|
84
105
|
end
|
|
85
106
|
|
|
@@ -96,16 +117,19 @@ module Pikuri
|
|
|
96
117
|
# dim mismatch.
|
|
97
118
|
def query(vector:, top_k:)
|
|
98
119
|
raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0
|
|
99
|
-
return [] if @entries.empty?
|
|
100
120
|
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
end
|
|
121
|
+
@lock.synchronize do
|
|
122
|
+
return [] if @entries.empty?
|
|
104
123
|
|
|
105
|
-
|
|
106
|
-
|
|
124
|
+
if vector.size != @dim
|
|
125
|
+
raise ArgumentError, "query vector dim #{vector.size}, stored dim #{@dim}"
|
|
126
|
+
end
|
|
127
|
+
|
|
128
|
+
scored = @entries.values.map do |chunk, stored|
|
|
129
|
+
Result.new(chunk: chunk, score: cosine(vector, stored))
|
|
130
|
+
end
|
|
131
|
+
scored.sort_by { |r| -r.score }.first(top_k)
|
|
107
132
|
end
|
|
108
|
-
scored.sort_by { |r| -r.score }.first(top_k)
|
|
109
133
|
end
|
|
110
134
|
|
|
111
135
|
# Drop every stored chunk. Used by the v1 nuke-and-reload
|
|
@@ -114,14 +138,89 @@ module Pikuri
|
|
|
114
138
|
#
|
|
115
139
|
# @return [void]
|
|
116
140
|
def delete_all
|
|
117
|
-
@
|
|
118
|
-
|
|
141
|
+
@lock.synchronize do
|
|
142
|
+
@entries.clear
|
|
143
|
+
@dim = nil
|
|
144
|
+
end
|
|
119
145
|
nil
|
|
120
146
|
end
|
|
121
147
|
|
|
122
148
|
# @return [Integer] current chunk count.
|
|
123
149
|
def count
|
|
124
|
-
@entries.size
|
|
150
|
+
@lock.synchronize { @entries.size }
|
|
151
|
+
end
|
|
152
|
+
|
|
153
|
+
# Remove every chunk whose +source+ matches. The scoped
|
|
154
|
+
# counterpart to {#delete_all} — drops one document's chunks
|
|
155
|
+
# without touching the rest. No-op (and no error) when the
|
|
156
|
+
# source isn't present. The dim lock is left intact: unlike
|
|
157
|
+
# {#delete_all}, a per-source delete doesn't imply an
|
|
158
|
+
# embedder change.
|
|
159
|
+
#
|
|
160
|
+
# @param source [String] the {Chunk#source} to purge, e.g.
|
|
161
|
+
# +"notes/cooking.md"+.
|
|
162
|
+
# @return [void]
|
|
163
|
+
def delete_by_source(source)
|
|
164
|
+
@lock.synchronize do
|
|
165
|
+
@entries.reject! { |_id, (chunk, _vector)| chunk.source == source }
|
|
166
|
+
end
|
|
167
|
+
nil
|
|
168
|
+
end
|
|
169
|
+
|
|
170
|
+
# Atomically replace all chunks for one +source+: delete the
|
|
171
|
+
# old set, then upsert the new one, under a single hold of the
|
|
172
|
+
# monitor. The incremental-reindex unit (see {Indexer#reindex_file!}).
|
|
173
|
+
# Holding the lock across both halves is the point — a
|
|
174
|
+
# concurrent {#query} sees either the old chunks or the new
|
|
175
|
+
# ones, never the empty gap between.
|
|
176
|
+
#
|
|
177
|
+
# @param source [String] the {Chunk#source} being replaced.
|
|
178
|
+
# @param chunks [Array<Chunk>] the new chunk set; every
|
|
179
|
+
# +chunk.source+ should equal +source+.
|
|
180
|
+
# @param vectors [Array<Array<Float>>] parallel to +chunks+.
|
|
181
|
+
# @return [void]
|
|
182
|
+
# @raise [ArgumentError] on empty input, length mismatch, or
|
|
183
|
+
# vector-dim mismatch (from the inner {#upsert}).
|
|
184
|
+
def replace_source(source:, chunks:, vectors:)
|
|
185
|
+
@lock.synchronize do
|
|
186
|
+
delete_by_source(source)
|
|
187
|
+
upsert(chunks: chunks, vectors: vectors)
|
|
188
|
+
end
|
|
189
|
+
nil
|
|
190
|
+
end
|
|
191
|
+
|
|
192
|
+
# The boot-sweep reference: a map from each indexed +source+
|
|
193
|
+
# to the content hash stored on its chunks. {Watcher} (via
|
|
194
|
+
# {Indexer#reconcile_plan}) diffs this against the hashes of
|
|
195
|
+
# the files currently on disk to decide what to reindex.
|
|
196
|
+
# Built from chunk metadata; a chunk indexed before the
|
|
197
|
+
# +hash+ metadata existed maps its source to +nil+, which the
|
|
198
|
+
# diff treats as "changed" and reindexes — self-healing.
|
|
199
|
+
#
|
|
200
|
+
# @return [Hash{String => String, nil}] +source+ → content
|
|
201
|
+
# hash. Empty when nothing is indexed (the InMemory case at
|
|
202
|
+
# every boot, since RAM resets).
|
|
203
|
+
def sources_with_hashes
|
|
204
|
+
@lock.synchronize do
|
|
205
|
+
result = {}
|
|
206
|
+
@entries.each_value do |chunk, _vector|
|
|
207
|
+
result[chunk.source] ||= chunk.metadata[:hash]
|
|
208
|
+
end
|
|
209
|
+
result
|
|
210
|
+
end
|
|
211
|
+
end
|
|
212
|
+
|
|
213
|
+
# Is +source+ in the corpus? The scoped membership test
|
|
214
|
+
# behind {VectorDb::Tools::Read}'s gate — a short-circuiting scan
|
|
215
|
+
# rather than building the whole {#sources_with_hashes} map
|
|
216
|
+
# just to read one key. See the Backend protocol yardoc.
|
|
217
|
+
#
|
|
218
|
+
# @param source [String] the {Chunk#source} to test.
|
|
219
|
+
# @return [Boolean] true if at least one chunk has this source.
|
|
220
|
+
def source_indexed?(source)
|
|
221
|
+
@lock.synchronize do
|
|
222
|
+
@entries.each_value.any? { |chunk, _vector| chunk.source == source }
|
|
223
|
+
end
|
|
125
224
|
end
|
|
126
225
|
|
|
127
226
|
private
|