pikuri-vectordb 0.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 3a64900380a7c48712b1ea58053f141d1c6f5da57e723925ca24319899e91607
4
+ data.tar.gz: 5ed708f037e0cd979e1a2018b9fabf0d996679befbcca69def2d58ce8aac8fd4
5
+ SHA512:
6
+ metadata.gz: ada1509afe42590bae569b032cc88fe4ae2ed724a152388fc710b3fe5275f256fee2af6e2db29bb82469d11d651698a653addf10a7679df4b57acadaaab76720
7
+ data.tar.gz: d25268e97f43a4c73fb4eea1e957a3c9129b4d543e3c78e767318cc784eb0898ec3ba647210b6d969521a2214c6934c5f2aadb387e32f98ca655ff5eee0bb92c
data/README.md ADDED
@@ -0,0 +1,121 @@
1
+ # pikuri-vectordb
2
+
3
+ Local-corpus vector search + agentic RAG for the
4
+ [pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit.
5
+
6
+ > **Status:** skeleton — gem scaffolding only. The
7
+ > `Pikuri::VectorDb::Extension` and `vectordb_search` tool are
8
+ > being built in subsequent commits. See `IDEAS.md` §"Vector DB /
9
+ > RAG" for the design.
10
+
11
+ Will provide:
12
+
13
+ - `Pikuri::VectorDb::Extension` — wires a `vectordb_search` tool +
14
+ a `vectordb_reindex` tool onto a `Pikuri::Agent` via
15
+ `c.add_extension(...)` inside the `Agent.new` block.
16
+ - `Pikuri::VectorDb::Backend::InMemory` — pure-Ruby cosine over
17
+ `Array<Float>`. The educational default; reads in ~40 lines.
18
+ RAM-only; everything reloads from sources on every boot.
19
+ - `Pikuri::VectorDb::Backend::Chroma` — thin Faraday HTTP client
20
+ against a self-hosted ChromaDB. The persistent option.
21
+ - `Pikuri::VectorDb::Embedder` — thin wrapper over `RubyLLM.embed`
22
+ so tests can inject a fake without monkey-patching ruby_llm.
23
+ - `Pikuri::VectorDb::Reranker::LlamaServer` — optional quality
24
+ knob. Speaks `/v1/rerank` against a cross-encoder model on a
25
+ llama.cpp server. Passing `reranker: nil` to the extension
26
+ skips reranking; retrieval falls back to vector-only top-k.
27
+ - `Pikuri::VectorDb::Chunker::FixedWindow` + `Tokenizer::*` —
28
+ the chunking pipeline. Tokenizer is a duck-typed protocol
29
+ (`count(text) -> Integer`) with two impls in v1:
30
+ `Tokenizer::CharHeuristic` (default, ~4 chars/token rule) and
31
+ `Tokenizer::LlamaServer` (POST `/tokenize` against the
32
+ embedder's endpoint).
33
+ - Text extraction reuses `Pikuri::FileType.read_as_text` from
34
+ pikuri-core — plain text / Markdown / PDF. HTML extraction
35
+ is a deferred follow-up; v1 corpora skew toward Markdown
36
+ notes and PDF docs in practice.
37
+ - `Pikuri::VectorDb::LIBRARIAN` — bundled
38
+ `Pikuri::SubAgent::Persona` constant. Hosts wire it via
39
+ `SubAgent::Extension.new(personas: [..., LIBRARIAN])` — same
40
+ shape `pikuri-code` uses for `GIT_REPO_RESEARCHER`.
41
+
42
+ ## Install
43
+
44
+ ```ruby
45
+ # Gemfile
46
+ gem 'pikuri-vectordb'
47
+ ```
48
+
49
+ ## Usage (preview — not yet wired)
50
+
51
+ ```ruby
52
+ require 'pikuri-core'
53
+ require 'pikuri-vectordb'
54
+
55
+ backend = Pikuri::VectorDb::Backend::InMemory.new
56
+ # Or for persistent storage:
57
+ # backend = Pikuri::VectorDb::Backend::Chroma.new(
58
+ # host: 'localhost', port: 8000, collection: 'my-docs',
59
+ # )
60
+ agent = Pikuri::Agent.new(transport: ..., system_prompt: ...) do |c|
61
+ c.add_extension(
62
+ Pikuri::VectorDb::Extension.new(
63
+ backend: backend,
64
+ source: '~/notes',
65
+ )
66
+ )
67
+ end
68
+ ```
69
+
70
+ Collection naming is Chroma-specific so it lives on
71
+ `Backend::Chroma.new(collection:)`, not on the Extension —
72
+ `Backend::InMemory` has no collection concept.
73
+
74
+ For hosts that want recall behind a privilege-separated sub-agent
75
+ (the trifecta-defense pattern — see `SECURITY.md` and `IDEAS.md`
76
+ §"Vector DB / RAG"), additionally wire the `LIBRARIAN` persona
77
+ via `pikuri-subagents`:
78
+
79
+ ```ruby
80
+ require 'pikuri-subagents'
81
+
82
+ c.add_extension(
83
+ Pikuri::SubAgent::Extension.new(
84
+ personas: [Pikuri::VectorDb::LIBRARIAN]
85
+ )
86
+ )
87
+ ```
88
+
89
+ ## Three model endpoints
90
+
91
+ A full assistant setup wants three LLM endpoints: chat (via
92
+ `ruby_llm`), an embedder (via `RubyLLM.embed`), and an optional
93
+ reranker (HTTP `/v1/rerank`). Recommended setup: **one
94
+ `llama-server` running in router mode** — started with no
95
+ `--model` flag, it serves every GGUF in `~/.cache/llama.cpp/`
96
+ from a single port and loads whichever model each request asks
97
+ for. Requires a recent enough `llama.cpp` build to include the
98
+ [model-management feature](https://huggingface.co/blog/ggml-org/model-management-in-llamacpp);
99
+ Ubuntu 26.04+ packages one. The guide's
100
+ [chapter 1](../docs/guide/01-chat.md) walks through the setup;
101
+ [chapter 3](../docs/guide/03-vectordb.md) adds the embedder and
102
+ reranker on top.
103
+
104
+ If you'd rather pin the reranker in its own process — to avoid
105
+ paying the router's unload/reload cost on rerank requests —
106
+ `Reranker::LlamaServer` takes its own `endpoint:` argument and
107
+ can point at a separate `llama-server`. Otherwise pikuri stays
108
+ agnostic: it just needs URLs.
109
+
110
+ Larger multi-model runtimes (Ollama, LM Studio, ...) expose
111
+ OpenAI-compatible endpoints and would also work, but pikuri's
112
+ "small enough to audit" ethos keeps the recommended path on
113
+ `llama.cpp` alone.
114
+
115
+ ## Further reading
116
+
117
+ - **Design notes:** `IDEAS.md` §"Vector DB / RAG" at the repo
118
+ root.
119
+ - **API reference:** browse the YARD docs at
120
+ <https://rubydoc.info/gems/pikuri-vectordb> (once published),
121
+ or run `bundle exec yard` in this directory for a local copy.
@@ -0,0 +1,350 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'faraday'
4
+ require 'json'
5
+
6
+ module Pikuri
7
+ module VectorDb
8
+ module Backend
9
+ # Thin Faraday HTTP client against a self-hosted Chroma
10
+ # server (v2 API). The persistent backend, behind the same
11
+ # duck-typed {Backend} protocol as {InMemory}: same method
12
+ # names, same return shapes, same +ArgumentError+ contract
13
+ # on empty input + non-positive +top_k+. Where the two
14
+ # diverge is the vector-dim contract — see below.
15
+ #
16
+ # == Two ways to get one
17
+ #
18
+ # * **Bring your own.** +Backend::Chroma.new(host:, port:,
19
+ # collection:)+ against an existing chroma deployment
20
+ # (production cluster, docker-compose stack, a chroma
21
+ # already running on the host for an unrelated project).
22
+ # The host owns the process; this class is purely the
23
+ # HTTP client.
24
+ # * **Let pikuri manage it.** {ChromaServer.ensure_running}
25
+ # spawns and supervises a chroma container under the
26
+ # +pikuri-internal-chroma+ name, against a pinned image,
27
+ # with a bind-mounted volume in the user's cache dir.
28
+ # Its +#client(collection:)+ returns a +Backend::Chroma+
29
+ # pre-pointed at the supervised container. The split is
30
+ # deliberate: docker lifecycle and HTTP wire protocol
31
+ # have nothing in common, so each lives in its own class.
32
+ #
33
+ # == Chroma v2 API
34
+ #
35
+ # Endpoints used:
36
+ #
37
+ # * +POST /api/v2/tenants/{tenant}/databases/{db}/collections+
38
+ # with +get_or_create: true+ — idempotent collection
39
+ # creation. Returns +{id, name, ...}+.
40
+ # * +POST /api/v2/.../collections/{id}/upsert+ — insert or
41
+ # replace by id. Body carries parallel arrays of +ids+,
42
+ # +embeddings+, +documents+, +metadatas+.
43
+ # * +POST /api/v2/.../collections/{id}/query+ — k-NN
44
+ # search. Body: +{query_embeddings, n_results, include}+.
45
+ # * +GET /api/v2/.../collections/{id}/count+ — integer
46
+ # count.
47
+ # * +DELETE /api/v2/.../collections/{id}+ — drop the
48
+ # collection (used by +#delete_all+).
49
+ #
50
+ # == BYO embeddings (not Chroma's embedder)
51
+ #
52
+ # Chroma collections can carry an embedding function in
53
+ # their metadata — Chroma's term for what pikuri calls an
54
+ # {Embedder}. When configured, +add+ / +query+ accept raw
55
+ # text via +documents+ / +query_texts+ and Chroma embeds
56
+ # server-side. We deliberately don't use this: pikuri's
57
+ # +Embedder+ is the one source of truth for embedder
58
+ # choice, the provider-cliff visibility lives in pikuri's
59
+ # config, and a parallel Chroma-side embedder config would
60
+ # split the truth without pikuri noticing (e.g. local
61
+ # embedder in pikuri + +OpenAIEmbeddingFunction+ in Chroma
62
+ # — every indexed document silently lands at OpenAI). We
63
+ # always send pre-computed +embeddings+; Chroma's
64
+ # collection embedder is never invoked.
65
+ #
66
+ # == Vector-dim contract diverges from InMemory
67
+ #
68
+ # +InMemory+ enforces vector-dim consistency client-side
69
+ # (locks on first upsert, raises +ArgumentError+ on
70
+ # mismatch). +Chroma+ enforces server-side — first upsert
71
+ # to a collection establishes the dim; mismatched
72
+ # subsequent upserts produce HTTP 4xx which propagates
73
+ # as +RuntimeError+. Different exception class, same
74
+ # loud-failure shape. Documented divergence; not worth
75
+ # parsing Chroma's error envelope to coerce to +ArgumentError+.
76
+ #
77
+ # == Lazy collection resolution
78
+ #
79
+ # +Backend::Chroma.new+ doesn't talk to the server. The
80
+ # first +#upsert+ / +#query+ / +#count+ call resolves
81
+ # (and creates if missing) the collection by name, caches
82
+ # the id, and uses it thereafter. +#delete_all+ drops the
83
+ # collection and clears the cached id; the next +#upsert+
84
+ # re-creates from scratch.
85
+ #
86
+ # == Cosine distance (matches InMemory)
87
+ #
88
+ # Collection is created with +hnsw.space: 'cosine'+.
89
+ # Chroma returns cosine *distance* (range +[0, 2]+ where
90
+ # +0+ = identical, +1+ = orthogonal); +#query+ converts
91
+ # to similarity via +1 - distance+ so the {Backend::Result}
92
+ # score has the same meaning across backends.
93
+ #
94
+ # == Metadata key normalization
95
+ #
96
+ # Chroma serializes through JSON, so Symbol metadata keys
97
+ # become Strings on round-trip. +#upsert+ converts the
98
+ # incoming {Chunk}'s +metadata+ keys to Strings before
99
+ # sending; +#query+ converts them back to Symbols on the
100
+ # way out, so the {Chunk} a caller pulls from a query
101
+ # looks identical to one stored in InMemory. +source+
102
+ # rides as a special metadata key (Chroma has no native
103
+ # +source+ concept).
104
+ #
105
+ # == Testing posture
106
+ #
107
+ # Specs use +Faraday::Adapter::Test+ stubs only — they
108
+ # verify "we send what we think we're sending" against
109
+ # the v2 API shape but don't catch real-Chroma protocol
110
+ # drift. Real-Chroma smoke testing is wired into the demo
111
+ # binary in a later phase. Targets Chroma 0.5.x+ (v2 API).
112
+ class Chroma
113
+ # @param host [String]
114
+ # @param port [Integer]
115
+ # @param collection [String] collection name in Chroma.
116
+ # This is a Chroma-specific identifier, so it lives
117
+ # here rather than on +VectorDb::Extension+ (where
118
+ # it'd be a no-op for +Backend::InMemory+).
119
+ # @param tenant [String] Chroma v2 tenant; defaults to
120
+ # Chroma's own default.
121
+ # @param database [String] Chroma v2 database; defaults
122
+ # to Chroma's own default.
123
+ # @param connection [Faraday::Connection, nil] optional
124
+ # dependency-injection point for tests.
125
+ # @return [Chroma]
126
+ # @raise [ArgumentError] on empty +host+ or empty
127
+ # +collection+.
128
+ def initialize(host:, port:, collection:,
129
+ tenant: 'default_tenant',
130
+ database: 'default_database',
131
+ connection: nil)
132
+ raise ArgumentError, 'host must be non-empty' if host.nil? || host.to_s.empty?
133
+ raise ArgumentError, 'collection must be non-empty' if collection.nil? || collection.to_s.empty?
134
+
135
+ @host = host
136
+ @port = port
137
+ @collection_name = collection
138
+ @tenant = tenant
139
+ @database = database
140
+ @collection_id = nil
141
+ @connection = connection || Faraday.new(url: "http://#{host}:#{port}") do |f|
142
+ f.request :json
143
+ f.response :json
144
+ f.adapter Faraday.default_adapter
145
+ end
146
+ end
147
+
148
+ # Insert-or-replace by +chunk.id+. Parallel arrays of
149
+ # equal length; raises on empty input or length mismatch
150
+ # (same contract as {InMemory}). Chroma server enforces
151
+ # vector-dim consistency; mismatched dims surface as
152
+ # +RuntimeError+ from a 4xx response (the InMemory
153
+ # backend raises +ArgumentError+ for the same case —
154
+ # documented divergence).
155
+ #
156
+ # @param chunks [Array<Chunk>]
157
+ # @param vectors [Array<Array<Float>>]
158
+ # @return [void]
159
+ # @raise [ArgumentError] on empty input or length mismatch.
160
+ # @raise [RuntimeError] on HTTP failure.
161
+ def upsert(chunks:, vectors:)
162
+ raise ArgumentError, 'upsert called with empty chunks/vectors' if chunks.empty?
163
+ if chunks.size != vectors.size
164
+ raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors"
165
+ end
166
+
167
+ ensure_collection!
168
+
169
+ metadatas = chunks.map do |c|
170
+ # Serialize +source+ as a reserved key in Chroma's
171
+ # +metadata+; merge in the user's metadata Hash with
172
+ # keys stringified for JSON round-trip stability.
173
+ base = { 'source' => c.source }
174
+ c.metadata.each { |k, v| base[k.to_s] = v }
175
+ base
176
+ end
177
+
178
+ body = {
179
+ ids: chunks.map(&:id),
180
+ embeddings: vectors,
181
+ documents: chunks.map(&:text),
182
+ metadatas: metadatas
183
+ }
184
+
185
+ post_json("#{collection_path}/upsert", body)
186
+ nil
187
+ end
188
+
189
+ # k-NN query by cosine similarity. Returns at most
190
+ # +top_k+ {Backend::Result}s descending by score.
191
+ # +score+ is +1 - cosine_distance+ so the value matches
192
+ # {InMemory}'s cosine-similarity scale.
193
+ #
194
+ # @param vector [Array<Float>]
195
+ # @param top_k [Integer]
196
+ # @return [Array<Backend::Result>]
197
+ # @raise [ArgumentError] on non-positive +top_k+.
198
+ # @raise [RuntimeError] on HTTP failure.
199
+ def query(vector:, top_k:)
200
+ raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0
201
+
202
+ # If we've never upserted, the collection doesn't
203
+ # exist yet — semantic answer is "no hits."
204
+ return [] if @collection_id.nil? && !collection_exists?
205
+
206
+ response_body = post_json("#{collection_path}/query", {
207
+ query_embeddings: [vector],
208
+ n_results: top_k,
209
+ include: %w[documents metadatas distances]
210
+ })
211
+
212
+ ids = (response_body['ids'] || [[]]).first || []
213
+ docs = (response_body['documents'] || [[]]).first || []
214
+ metas = (response_body['metadatas'] || [[]]).first || []
215
+ dists = (response_body['distances'] || [[]]).first || []
216
+
217
+ ids.each_with_index.map do |id, i|
218
+ meta = metas[i] || {}
219
+ # Pull +source+ back out of the metadata blob;
220
+ # symbolize the remaining keys for round-trip
221
+ # consistency with InMemory.
222
+ source = meta['source'] || ''
223
+ chunk_meta = {}
224
+ meta.each do |k, v|
225
+ next if k == 'source'
226
+
227
+ chunk_meta[k.to_sym] = v
228
+ end
229
+
230
+ chunk = Chunk.new(id: id, source: source, text: docs[i] || '', metadata: chunk_meta)
231
+ Result.new(chunk: chunk, score: 1.0 - dists[i].to_f)
232
+ end
233
+ end
234
+
235
+ # Drop the collection. Next +#upsert+ re-creates from
236
+ # scratch — that's the v1 nuke-and-reload reindex path
237
+ # the {Indexer} drives. No-op if no collection was ever
238
+ # created (consistent with {InMemory}'s clear-on-empty
239
+ # behaviour). 404 on the DELETE is treated as "already
240
+ # gone" — idempotent.
241
+ #
242
+ # @return [void]
243
+ def delete_all
244
+ return nil if @collection_id.nil? && !collection_exists?
245
+
246
+ response = @connection.delete(collection_path)
247
+ unless [200, 204, 404].include?(response.status)
248
+ raise "Backend::Chroma: DELETE #{collection_path} returned " \
249
+ "HTTP #{response.status}: #{response.body.inspect}"
250
+ end
251
+ @collection_id = nil
252
+ nil
253
+ end
254
+
255
+ # @return [Integer] current chunk count. Zero before the
256
+ # first +#upsert+.
257
+ def count
258
+ return 0 if @collection_id.nil? && !collection_exists?
259
+
260
+ response = @connection.get("#{collection_path}/count")
261
+ unless response.status == 200
262
+ raise "Backend::Chroma: GET #{collection_path}/count returned " \
263
+ "HTTP #{response.status}: #{response.body.inspect}"
264
+ end
265
+
266
+ body = response.body
267
+ # Chroma v2 returns the count as a bare integer.
268
+ return body if body.is_a?(Integer)
269
+ return body['count'] if body.is_a?(Hash) && body['count'].is_a?(Integer)
270
+
271
+ raise "Backend::Chroma: count response was not an Integer (got #{body.inspect})"
272
+ end
273
+
274
+ private
275
+
276
+ def collections_path
277
+ "/api/v2/tenants/#{@tenant}/databases/#{@database}/collections"
278
+ end
279
+
280
+ def collection_path
281
+ raise 'Backend::Chroma: collection_id not yet resolved' unless @collection_id
282
+
283
+ "#{collections_path}/#{@collection_id}"
284
+ end
285
+
286
+ # Idempotent get-or-create against Chroma. Sets
287
+ # +@collection_id+ on success. Returns the id String.
288
+ def ensure_collection!
289
+ return @collection_id if @collection_id
290
+
291
+ body = post_json(collections_path, {
292
+ name: @collection_name,
293
+ configuration: { hnsw: { space: 'cosine' } },
294
+ get_or_create: true
295
+ })
296
+
297
+ id = body.is_a?(Hash) ? body['id'] : nil
298
+ raise "Backend::Chroma: collection-create response missing 'id' (got #{body.inspect})" unless id
299
+
300
+ @collection_id = id
301
+ end
302
+
303
+ # Probe whether the collection exists without creating
304
+ # it — used by +#query+ / +#count+ / +#delete_all+ to
305
+ # short-circuit when the user queries before any upsert
306
+ # has happened. Side-effect: caches +@collection_id+
307
+ # when found.
308
+ def collection_exists?
309
+ response = @connection.get(collections_path)
310
+ unless response.status == 200
311
+ raise "Backend::Chroma: GET #{collections_path} returned " \
312
+ "HTTP #{response.status}: #{response.body.inspect}"
313
+ end
314
+
315
+ list = response.body
316
+ return false unless list.is_a?(Array)
317
+
318
+ match = list.find { |c| c.is_a?(Hash) && c['name'] == @collection_name }
319
+ return false unless match
320
+
321
+ @collection_id = match['id']
322
+ true
323
+ rescue Faraday::Error => e
324
+ raise "Backend::Chroma: #{e.class.name.split('::').last} " \
325
+ "calling #{collections_path}: #{e.message}"
326
+ end
327
+
328
+ # Tiny JSON-POST helper shared by upsert / query /
329
+ # ensure_collection. Centralises the error-shape and
330
+ # Faraday::Error wrapping.
331
+ def post_json(path, body)
332
+ response = @connection.post(path) do |req|
333
+ req.headers['Content-Type'] = 'application/json'
334
+ req.body = body
335
+ end
336
+
337
+ unless [200, 201].include?(response.status)
338
+ raise "Backend::Chroma: POST #{path} returned " \
339
+ "HTTP #{response.status}: #{response.body.inspect}"
340
+ end
341
+
342
+ response.body
343
+ rescue Faraday::Error => e
344
+ raise "Backend::Chroma: #{e.class.name.split('::').last} " \
345
+ "calling #{path}: #{e.message}"
346
+ end
347
+ end
348
+ end
349
+ end
350
+ end
@@ -0,0 +1,149 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Pikuri
4
+ module VectorDb
5
+ module Backend
6
+ # Pure-Ruby vector store. The educational default backend;
7
+ # IDEAS.md §"Vector DB / RAG" frames it as the "small enough
8
+ # to audit" first stop the demo + guide walk through before
9
+ # promoting users to +Chroma+ for persistence.
10
+ #
11
+ # == What it does
12
+ #
13
+ # Holds an in-memory Hash from chunk id to +[Chunk, vector]+;
14
+ # +#query+ computes cosine similarity against every stored
15
+ # vector, sorts descending, returns the top-k as
16
+ # +Backend::Result+ instances. O(n) per query, where n is
17
+ # the number of stored chunks. Fine for thousands of chunks
18
+ # (a personal notes folder, a single product's docs); slow
19
+ # for millions (a full corporate knowledge base — that's the
20
+ # +Chroma+ use case).
21
+ #
22
+ # == What it deliberately doesn't do
23
+ #
24
+ # * **No persistence.** RAM-only, intentional — the user who
25
+ # wants persistence picks +Chroma+. Reloads from sources on
26
+ # every boot, which makes the in-memory backend the natural
27
+ # teaching shape: the same code path the demo binary walks
28
+ # on startup is the one the user inspects when they're
29
+ # learning what "indexing" actually means.
30
+ # * **No approximate search.** Exhaustive scan. Approximate
31
+ # nearest neighbor (HNSW, IVF) adds complexity that doesn't
32
+ # teach anything additional once the cosine math is clear.
33
+ # * **No thread safety.** {Indexer} runs single-threaded
34
+ # during a boot or reindex; {Search} calls +#query+ from
35
+ # the agent's main thread. No concurrent access today.
36
+ #
37
+ # == Cosine, not dot product
38
+ #
39
+ # Some embedders return pre-normalized vectors (text-embedding-3,
40
+ # most sentence-transformers); others don't. Cosine normalizes
41
+ # at compute time, so the backend works regardless of whether
42
+ # the embedder did. The readable two-pass form below (compute
43
+ # dot + magnitudes separately) is intentional over the
44
+ # single-loop micro-optimization — this is the file the
45
+ # newcomer reads to understand what's happening.
46
+ class InMemory
47
+ # @return [InMemory]
48
+ def initialize
49
+ # id (String) → [Chunk, vector (Array<Float>)]
50
+ @entries = {}
51
+ # Dimension of every stored vector. +nil+ before the first
52
+ # +#upsert+; locked to the dim of the first vector seen and
53
+ # enforced for every subsequent +#upsert+ + +#query+ — see
54
+ # the Backend protocol's "Vector-dim contract" yardoc.
55
+ @dim = nil
56
+ end
57
+
58
+ # Insert-or-replace by +chunk.id+. Parallel arrays of
59
+ # equal length; raises on empty input or length mismatch.
60
+ # Vector dimension is locked at first upsert; raises on
61
+ # any subsequent vector of a different dim.
62
+ #
63
+ # @param chunks [Array<Chunk>]
64
+ # @param vectors [Array<Array<Float>>]
65
+ # @return [void]
66
+ # @raise [ArgumentError] on empty input, length mismatch,
67
+ # or vector-dim mismatch.
68
+ def upsert(chunks:, vectors:)
69
+ raise ArgumentError, 'upsert called with empty chunks/vectors' if chunks.empty?
70
+ if chunks.size != vectors.size
71
+ raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors"
72
+ end
73
+
74
+ expected = @dim || vectors.first.size
75
+ vectors.each_with_index do |v, i|
76
+ next if v.size == expected
77
+
78
+ raise ArgumentError, "vector #{i} has dim #{v.size}, expected #{expected}"
79
+ end
80
+ @dim ||= expected
81
+
82
+ chunks.zip(vectors).each { |chunk, vector| @entries[chunk.id] = [chunk, vector] }
83
+ nil
84
+ end
85
+
86
+ # Cosine-similarity nearest neighbor search. Returns the
87
+ # top-k {Backend::Result}s in descending score order;
88
+ # empty array when the store has no entries.
89
+ #
90
+ # @param vector [Array<Float>] query vector; must match
91
+ # the stored vector dim.
92
+ # @param top_k [Integer] number of results to return;
93
+ # must be positive.
94
+ # @return [Array<Backend::Result>]
95
+ # @raise [ArgumentError] on +top_k+ <= 0 or query-vector
96
+ # dim mismatch.
97
+ def query(vector:, top_k:)
98
+ raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0
99
+ return [] if @entries.empty?
100
+
101
+ if vector.size != @dim
102
+ raise ArgumentError, "query vector dim #{vector.size}, stored dim #{@dim}"
103
+ end
104
+
105
+ scored = @entries.values.map do |chunk, stored|
106
+ Result.new(chunk: chunk, score: cosine(vector, stored))
107
+ end
108
+ scored.sort_by { |r| -r.score }.first(top_k)
109
+ end
110
+
111
+ # Drop every stored chunk. Used by the v1 nuke-and-reload
112
+ # reindex flow; the embedder dim lock is also released so
113
+ # a reindex with a different embedder model starts clean.
114
+ #
115
+ # @return [void]
116
+ def delete_all
117
+ @entries.clear
118
+ @dim = nil
119
+ nil
120
+ end
121
+
122
+ # @return [Integer] current chunk count.
123
+ def count
124
+ @entries.size
125
+ end
126
+
127
+ private
128
+
129
+ # Cosine similarity. Two-pass form is the readable shape;
130
+ # micro-optimizing into a single loop saves an array
131
+ # traversal but obscures what's happening for the reader
132
+ # who's here to learn how a vector store works.
133
+ #
134
+ # @param a [Array<Float>]
135
+ # @param b [Array<Float>]
136
+ # @return [Float] cosine in +[-1.0, 1.0]+, or +0.0+ if
137
+ # either vector is zero (degenerate but valid input).
138
+ def cosine(a, b)
139
+ dot = a.zip(b).sum { |x, y| x * y }
140
+ mag_a = Math.sqrt(a.sum { |x| x * x })
141
+ mag_b = Math.sqrt(b.sum { |x| x * x })
142
+ return 0.0 if mag_a.zero? || mag_b.zero?
143
+
144
+ dot / (mag_a * mag_b)
145
+ end
146
+ end
147
+ end
148
+ end
149
+ end
@@ -0,0 +1,46 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Pikuri
4
+ module VectorDb
5
+ module Backend
6
+ # A single query hit: the {Chunk} that matched + its cosine
7
+ # similarity score. Returned in descending-score order from
8
+ # any backend's +#query+ method.
9
+ #
10
+ # == Fields
11
+ #
12
+ # * +chunk+ — the stored {Chunk} (id, text, metadata).
13
+ # * +score+ — cosine similarity as +Float+, range
14
+ # +[-1.0, 1.0]+. For typical sentence-embedder vectors
15
+ # (all-positive component dimensions after the softmax /
16
+ # normalization most models bake in) scores stay in
17
+ # +[0.0, 1.0]+ in practice. 1.0 is exact match; 0.0 is
18
+ # orthogonal.
19
+ #
20
+ # == Citation
21
+ #
22
+ # {Search} formats hits for the LLM as
23
+ # +result.chunk.source+ (the citation — relative path, URL,
24
+ # or doc ID — see {Chunk}'s +source+ field) +
25
+ # +result.chunk.text+ (the snippet) + the score. The +id+
26
+ # field is opaque and never surfaced; the +metadata+ Hash
27
+ # carries optional extras like offset / page / anchor for
28
+ # callers that want to deep-link.
29
+ #
30
+ # The +score+ is cosine *as returned by a backend's
31
+ # +#query+*. {Search} substitutes the reranker's relevance
32
+ # score here (via +Data#with+) before formatting when a
33
+ # reranker reorders the candidates, so a surfaced score is
34
+ # cosine only in vector-only mode — see {Search}'s
35
+ # "Which score is shown" section.
36
+ #
37
+ # == Why a Data.define
38
+ #
39
+ # Same convention as {Chunk}. A tuple +[chunk, score]+ or
40
+ # a +{chunk:, score:}+ Hash would work at runtime but
41
+ # +result.chunk+ / +result.score+ reads cleaner at call
42
+ # sites, and value-equality is occasionally useful in specs.
43
+ Result = Data.define(:chunk, :score)
44
+ end
45
+ end
46
+ end