pikuri-vectordb 0.0.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/README.md +121 -0
- data/lib/pikuri/vector_db/backend/chroma.rb +350 -0
- data/lib/pikuri/vector_db/backend/in_memory.rb +149 -0
- data/lib/pikuri/vector_db/backend/result.rb +46 -0
- data/lib/pikuri/vector_db/backend.rb +47 -0
- data/lib/pikuri/vector_db/chroma_server.rb +309 -0
- data/lib/pikuri/vector_db/chunk.rb +60 -0
- data/lib/pikuri/vector_db/chunker/fixed_window.rb +150 -0
- data/lib/pikuri/vector_db/chunker.rb +48 -0
- data/lib/pikuri/vector_db/embedder.rb +120 -0
- data/lib/pikuri/vector_db/extension.rb +151 -0
- data/lib/pikuri/vector_db/indexer.rb +257 -0
- data/lib/pikuri/vector_db/librarian.rb +66 -0
- data/lib/pikuri/vector_db/reindex.rb +86 -0
- data/lib/pikuri/vector_db/reranker/hit.rb +35 -0
- data/lib/pikuri/vector_db/reranker/llama_server.rb +146 -0
- data/lib/pikuri/vector_db/reranker.rb +55 -0
- data/lib/pikuri/vector_db/search.rb +201 -0
- data/lib/pikuri/vector_db/tokenizer/char_heuristic.rb +60 -0
- data/lib/pikuri/vector_db/tokenizer/llama_server.rb +104 -0
- data/lib/pikuri/vector_db/tokenizer.rb +57 -0
- data/lib/pikuri-vectordb.rb +91 -0
- data/prompts/persona-librarian.txt +12 -0
- data/prompts/pikuri-librarian.txt +22 -0
- metadata +122 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 3a64900380a7c48712b1ea58053f141d1c6f5da57e723925ca24319899e91607
|
|
4
|
+
data.tar.gz: 5ed708f037e0cd979e1a2018b9fabf0d996679befbcca69def2d58ce8aac8fd4
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: ada1509afe42590bae569b032cc88fe4ae2ed724a152388fc710b3fe5275f256fee2af6e2db29bb82469d11d651698a653addf10a7679df4b57acadaaab76720
|
|
7
|
+
data.tar.gz: d25268e97f43a4c73fb4eea1e957a3c9129b4d543e3c78e767318cc784eb0898ec3ba647210b6d969521a2214c6934c5f2aadb387e32f98ca655ff5eee0bb92c
|
data/README.md
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
1
|
+
# pikuri-vectordb
|
|
2
|
+
|
|
3
|
+
Local-corpus vector search + agentic RAG for the
|
|
4
|
+
[pikuri](https://codeberg.org/mvysny/pikuri) AI-assistant toolkit.
|
|
5
|
+
|
|
6
|
+
> **Status:** skeleton — gem scaffolding only. The
|
|
7
|
+
> `Pikuri::VectorDb::Extension` and `vectordb_search` tool are
|
|
8
|
+
> being built in subsequent commits. See `IDEAS.md` §"Vector DB /
|
|
9
|
+
> RAG" for the design.
|
|
10
|
+
|
|
11
|
+
Will provide:
|
|
12
|
+
|
|
13
|
+
- `Pikuri::VectorDb::Extension` — wires a `vectordb_search` tool +
|
|
14
|
+
a `vectordb_reindex` tool onto a `Pikuri::Agent` via
|
|
15
|
+
`c.add_extension(...)` inside the `Agent.new` block.
|
|
16
|
+
- `Pikuri::VectorDb::Backend::InMemory` — pure-Ruby cosine over
|
|
17
|
+
`Array<Float>`. The educational default; reads in ~40 lines.
|
|
18
|
+
RAM-only; everything reloads from sources on every boot.
|
|
19
|
+
- `Pikuri::VectorDb::Backend::Chroma` — thin Faraday HTTP client
|
|
20
|
+
against a self-hosted ChromaDB. The persistent option.
|
|
21
|
+
- `Pikuri::VectorDb::Embedder` — thin wrapper over `RubyLLM.embed`
|
|
22
|
+
so tests can inject a fake without monkey-patching ruby_llm.
|
|
23
|
+
- `Pikuri::VectorDb::Reranker::LlamaServer` — optional quality
|
|
24
|
+
knob. Speaks `/v1/rerank` against a cross-encoder model on a
|
|
25
|
+
llama.cpp server. Passing `reranker: nil` to the extension
|
|
26
|
+
skips reranking; retrieval falls back to vector-only top-k.
|
|
27
|
+
- `Pikuri::VectorDb::Chunker::FixedWindow` + `Tokenizer::*` —
|
|
28
|
+
the chunking pipeline. Tokenizer is a duck-typed protocol
|
|
29
|
+
(`count(text) -> Integer`) with two impls in v1:
|
|
30
|
+
`Tokenizer::CharHeuristic` (default, ~4 chars/token rule) and
|
|
31
|
+
`Tokenizer::LlamaServer` (POST `/tokenize` against the
|
|
32
|
+
embedder's endpoint).
|
|
33
|
+
- Text extraction reuses `Pikuri::FileType.read_as_text` from
|
|
34
|
+
pikuri-core — plain text / Markdown / PDF. HTML extraction
|
|
35
|
+
is a deferred follow-up; v1 corpora skew toward Markdown
|
|
36
|
+
notes and PDF docs in practice.
|
|
37
|
+
- `Pikuri::VectorDb::LIBRARIAN` — bundled
|
|
38
|
+
`Pikuri::SubAgent::Persona` constant. Hosts wire it via
|
|
39
|
+
`SubAgent::Extension.new(personas: [..., LIBRARIAN])` — same
|
|
40
|
+
shape `pikuri-code` uses for `GIT_REPO_RESEARCHER`.
|
|
41
|
+
|
|
42
|
+
## Install
|
|
43
|
+
|
|
44
|
+
```ruby
|
|
45
|
+
# Gemfile
|
|
46
|
+
gem 'pikuri-vectordb'
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Usage (preview — not yet wired)
|
|
50
|
+
|
|
51
|
+
```ruby
|
|
52
|
+
require 'pikuri-core'
|
|
53
|
+
require 'pikuri-vectordb'
|
|
54
|
+
|
|
55
|
+
backend = Pikuri::VectorDb::Backend::InMemory.new
|
|
56
|
+
# Or for persistent storage:
|
|
57
|
+
# backend = Pikuri::VectorDb::Backend::Chroma.new(
|
|
58
|
+
# host: 'localhost', port: 8000, collection: 'my-docs',
|
|
59
|
+
# )
|
|
60
|
+
agent = Pikuri::Agent.new(transport: ..., system_prompt: ...) do |c|
|
|
61
|
+
c.add_extension(
|
|
62
|
+
Pikuri::VectorDb::Extension.new(
|
|
63
|
+
backend: backend,
|
|
64
|
+
source: '~/notes',
|
|
65
|
+
)
|
|
66
|
+
)
|
|
67
|
+
end
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
Collection naming is Chroma-specific so it lives on
|
|
71
|
+
`Backend::Chroma.new(collection:)`, not on the Extension —
|
|
72
|
+
`Backend::InMemory` has no collection concept.
|
|
73
|
+
|
|
74
|
+
For hosts that want recall behind a privilege-separated sub-agent
|
|
75
|
+
(the trifecta-defense pattern — see `SECURITY.md` and `IDEAS.md`
|
|
76
|
+
§"Vector DB / RAG"), additionally wire the `LIBRARIAN` persona
|
|
77
|
+
via `pikuri-subagents`:
|
|
78
|
+
|
|
79
|
+
```ruby
|
|
80
|
+
require 'pikuri-subagents'
|
|
81
|
+
|
|
82
|
+
c.add_extension(
|
|
83
|
+
Pikuri::SubAgent::Extension.new(
|
|
84
|
+
personas: [Pikuri::VectorDb::LIBRARIAN]
|
|
85
|
+
)
|
|
86
|
+
)
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Three model endpoints
|
|
90
|
+
|
|
91
|
+
A full assistant setup wants three LLM endpoints: chat (via
|
|
92
|
+
`ruby_llm`), an embedder (via `RubyLLM.embed`), and an optional
|
|
93
|
+
reranker (HTTP `/v1/rerank`). Recommended setup: **one
|
|
94
|
+
`llama-server` running in router mode** — started with no
|
|
95
|
+
`--model` flag, it serves every GGUF in `~/.cache/llama.cpp/`
|
|
96
|
+
from a single port and loads whichever model each request asks
|
|
97
|
+
for. Requires a recent enough `llama.cpp` build to include the
|
|
98
|
+
[model-management feature](https://huggingface.co/blog/ggml-org/model-management-in-llamacpp);
|
|
99
|
+
Ubuntu 26.04+ packages one. The guide's
|
|
100
|
+
[chapter 1](../docs/guide/01-chat.md) walks through the setup;
|
|
101
|
+
[chapter 3](../docs/guide/03-vectordb.md) adds the embedder and
|
|
102
|
+
reranker on top.
|
|
103
|
+
|
|
104
|
+
If you'd rather pin the reranker in its own process — to avoid
|
|
105
|
+
paying the router's unload/reload cost on rerank requests —
|
|
106
|
+
`Reranker::LlamaServer` takes its own `endpoint:` argument and
|
|
107
|
+
can point at a separate `llama-server`. Otherwise pikuri stays
|
|
108
|
+
agnostic: it just needs URLs.
|
|
109
|
+
|
|
110
|
+
Larger multi-model runtimes (Ollama, LM Studio, ...) expose
|
|
111
|
+
OpenAI-compatible endpoints and would also work, but pikuri's
|
|
112
|
+
"small enough to audit" ethos keeps the recommended path on
|
|
113
|
+
`llama.cpp` alone.
|
|
114
|
+
|
|
115
|
+
## Further reading
|
|
116
|
+
|
|
117
|
+
- **Design notes:** `IDEAS.md` §"Vector DB / RAG" at the repo
|
|
118
|
+
root.
|
|
119
|
+
- **API reference:** browse the YARD docs at
|
|
120
|
+
<https://rubydoc.info/gems/pikuri-vectordb> (once published),
|
|
121
|
+
or run `bundle exec yard` in this directory for a local copy.
|
|
@@ -0,0 +1,350 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'faraday'
|
|
4
|
+
require 'json'
|
|
5
|
+
|
|
6
|
+
module Pikuri
|
|
7
|
+
module VectorDb
|
|
8
|
+
module Backend
|
|
9
|
+
# Thin Faraday HTTP client against a self-hosted Chroma
|
|
10
|
+
# server (v2 API). The persistent backend, behind the same
|
|
11
|
+
# duck-typed {Backend} protocol as {InMemory}: same method
|
|
12
|
+
# names, same return shapes, same +ArgumentError+ contract
|
|
13
|
+
# on empty input + non-positive +top_k+. Where the two
|
|
14
|
+
# diverge is the vector-dim contract — see below.
|
|
15
|
+
#
|
|
16
|
+
# == Two ways to get one
|
|
17
|
+
#
|
|
18
|
+
# * **Bring your own.** +Backend::Chroma.new(host:, port:,
|
|
19
|
+
# collection:)+ against an existing chroma deployment
|
|
20
|
+
# (production cluster, docker-compose stack, a chroma
|
|
21
|
+
# already running on the host for an unrelated project).
|
|
22
|
+
# The host owns the process; this class is purely the
|
|
23
|
+
# HTTP client.
|
|
24
|
+
# * **Let pikuri manage it.** {ChromaServer.ensure_running}
|
|
25
|
+
# spawns and supervises a chroma container under the
|
|
26
|
+
# +pikuri-internal-chroma+ name, against a pinned image,
|
|
27
|
+
# with a bind-mounted volume in the user's cache dir.
|
|
28
|
+
# Its +#client(collection:)+ returns a +Backend::Chroma+
|
|
29
|
+
# pre-pointed at the supervised container. The split is
|
|
30
|
+
# deliberate: docker lifecycle and HTTP wire protocol
|
|
31
|
+
# have nothing in common, so each lives in its own class.
|
|
32
|
+
#
|
|
33
|
+
# == Chroma v2 API
|
|
34
|
+
#
|
|
35
|
+
# Endpoints used:
|
|
36
|
+
#
|
|
37
|
+
# * +POST /api/v2/tenants/{tenant}/databases/{db}/collections+
|
|
38
|
+
# with +get_or_create: true+ — idempotent collection
|
|
39
|
+
# creation. Returns +{id, name, ...}+.
|
|
40
|
+
# * +POST /api/v2/.../collections/{id}/upsert+ — insert or
|
|
41
|
+
# replace by id. Body carries parallel arrays of +ids+,
|
|
42
|
+
# +embeddings+, +documents+, +metadatas+.
|
|
43
|
+
# * +POST /api/v2/.../collections/{id}/query+ — k-NN
|
|
44
|
+
# search. Body: +{query_embeddings, n_results, include}+.
|
|
45
|
+
# * +GET /api/v2/.../collections/{id}/count+ — integer
|
|
46
|
+
# count.
|
|
47
|
+
# * +DELETE /api/v2/.../collections/{id}+ — drop the
|
|
48
|
+
# collection (used by +#delete_all+).
|
|
49
|
+
#
|
|
50
|
+
# == BYO embeddings (not Chroma's embedder)
|
|
51
|
+
#
|
|
52
|
+
# Chroma collections can carry an embedding function in
|
|
53
|
+
# their metadata — Chroma's term for what pikuri calls an
|
|
54
|
+
# {Embedder}. When configured, +add+ / +query+ accept raw
|
|
55
|
+
# text via +documents+ / +query_texts+ and Chroma embeds
|
|
56
|
+
# server-side. We deliberately don't use this: pikuri's
|
|
57
|
+
# +Embedder+ is the one source of truth for embedder
|
|
58
|
+
# choice, the provider-cliff visibility lives in pikuri's
|
|
59
|
+
# config, and a parallel Chroma-side embedder config would
|
|
60
|
+
# split the truth without pikuri noticing (e.g. local
|
|
61
|
+
# embedder in pikuri + +OpenAIEmbeddingFunction+ in Chroma
|
|
62
|
+
# — every indexed document silently lands at OpenAI). We
|
|
63
|
+
# always send pre-computed +embeddings+; Chroma's
|
|
64
|
+
# collection embedder is never invoked.
|
|
65
|
+
#
|
|
66
|
+
# == Vector-dim contract diverges from InMemory
|
|
67
|
+
#
|
|
68
|
+
# +InMemory+ enforces vector-dim consistency client-side
|
|
69
|
+
# (locks on first upsert, raises +ArgumentError+ on
|
|
70
|
+
# mismatch). +Chroma+ enforces server-side — first upsert
|
|
71
|
+
# to a collection establishes the dim; mismatched
|
|
72
|
+
# subsequent upserts produce HTTP 4xx which propagates
|
|
73
|
+
# as +RuntimeError+. Different exception class, same
|
|
74
|
+
# loud-failure shape. Documented divergence; not worth
|
|
75
|
+
# parsing Chroma's error envelope to coerce to +ArgumentError+.
|
|
76
|
+
#
|
|
77
|
+
# == Lazy collection resolution
|
|
78
|
+
#
|
|
79
|
+
# +Backend::Chroma.new+ doesn't talk to the server. The
|
|
80
|
+
# first +#upsert+ / +#query+ / +#count+ call resolves
|
|
81
|
+
# (and creates if missing) the collection by name, caches
|
|
82
|
+
# the id, and uses it thereafter. +#delete_all+ drops the
|
|
83
|
+
# collection and clears the cached id; the next +#upsert+
|
|
84
|
+
# re-creates from scratch.
|
|
85
|
+
#
|
|
86
|
+
# == Cosine distance (matches InMemory)
|
|
87
|
+
#
|
|
88
|
+
# Collection is created with +hnsw.space: 'cosine'+.
|
|
89
|
+
# Chroma returns cosine *distance* (range +[0, 2]+ where
|
|
90
|
+
# +0+ = identical, +1+ = orthogonal); +#query+ converts
|
|
91
|
+
# to similarity via +1 - distance+ so the {Backend::Result}
|
|
92
|
+
# score has the same meaning across backends.
|
|
93
|
+
#
|
|
94
|
+
# == Metadata key normalization
|
|
95
|
+
#
|
|
96
|
+
# Chroma serializes through JSON, so Symbol metadata keys
|
|
97
|
+
# become Strings on round-trip. +#upsert+ converts the
|
|
98
|
+
# incoming {Chunk}'s +metadata+ keys to Strings before
|
|
99
|
+
# sending; +#query+ converts them back to Symbols on the
|
|
100
|
+
# way out, so the {Chunk} a caller pulls from a query
|
|
101
|
+
# looks identical to one stored in InMemory. +source+
|
|
102
|
+
# rides as a special metadata key (Chroma has no native
|
|
103
|
+
# +source+ concept).
|
|
104
|
+
#
|
|
105
|
+
# == Testing posture
|
|
106
|
+
#
|
|
107
|
+
# Specs use +Faraday::Adapter::Test+ stubs only — they
|
|
108
|
+
# verify "we send what we think we're sending" against
|
|
109
|
+
# the v2 API shape but don't catch real-Chroma protocol
|
|
110
|
+
# drift. Real-Chroma smoke testing is wired into the demo
|
|
111
|
+
# binary in a later phase. Targets Chroma 0.5.x+ (v2 API).
|
|
112
|
+
class Chroma
|
|
113
|
+
# @param host [String]
|
|
114
|
+
# @param port [Integer]
|
|
115
|
+
# @param collection [String] collection name in Chroma.
|
|
116
|
+
# This is a Chroma-specific identifier, so it lives
|
|
117
|
+
# here rather than on +VectorDb::Extension+ (where
|
|
118
|
+
# it'd be a no-op for +Backend::InMemory+).
|
|
119
|
+
# @param tenant [String] Chroma v2 tenant; defaults to
|
|
120
|
+
# Chroma's own default.
|
|
121
|
+
# @param database [String] Chroma v2 database; defaults
|
|
122
|
+
# to Chroma's own default.
|
|
123
|
+
# @param connection [Faraday::Connection, nil] optional
|
|
124
|
+
# dependency-injection point for tests.
|
|
125
|
+
# @return [Chroma]
|
|
126
|
+
# @raise [ArgumentError] on empty +host+ or empty
|
|
127
|
+
# +collection+.
|
|
128
|
+
def initialize(host:, port:, collection:,
|
|
129
|
+
tenant: 'default_tenant',
|
|
130
|
+
database: 'default_database',
|
|
131
|
+
connection: nil)
|
|
132
|
+
raise ArgumentError, 'host must be non-empty' if host.nil? || host.to_s.empty?
|
|
133
|
+
raise ArgumentError, 'collection must be non-empty' if collection.nil? || collection.to_s.empty?
|
|
134
|
+
|
|
135
|
+
@host = host
|
|
136
|
+
@port = port
|
|
137
|
+
@collection_name = collection
|
|
138
|
+
@tenant = tenant
|
|
139
|
+
@database = database
|
|
140
|
+
@collection_id = nil
|
|
141
|
+
@connection = connection || Faraday.new(url: "http://#{host}:#{port}") do |f|
|
|
142
|
+
f.request :json
|
|
143
|
+
f.response :json
|
|
144
|
+
f.adapter Faraday.default_adapter
|
|
145
|
+
end
|
|
146
|
+
end
|
|
147
|
+
|
|
148
|
+
# Insert-or-replace by +chunk.id+. Parallel arrays of
|
|
149
|
+
# equal length; raises on empty input or length mismatch
|
|
150
|
+
# (same contract as {InMemory}). Chroma server enforces
|
|
151
|
+
# vector-dim consistency; mismatched dims surface as
|
|
152
|
+
# +RuntimeError+ from a 4xx response (the InMemory
|
|
153
|
+
# backend raises +ArgumentError+ for the same case —
|
|
154
|
+
# documented divergence).
|
|
155
|
+
#
|
|
156
|
+
# @param chunks [Array<Chunk>]
|
|
157
|
+
# @param vectors [Array<Array<Float>>]
|
|
158
|
+
# @return [void]
|
|
159
|
+
# @raise [ArgumentError] on empty input or length mismatch.
|
|
160
|
+
# @raise [RuntimeError] on HTTP failure.
|
|
161
|
+
def upsert(chunks:, vectors:)
|
|
162
|
+
raise ArgumentError, 'upsert called with empty chunks/vectors' if chunks.empty?
|
|
163
|
+
if chunks.size != vectors.size
|
|
164
|
+
raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors"
|
|
165
|
+
end
|
|
166
|
+
|
|
167
|
+
ensure_collection!
|
|
168
|
+
|
|
169
|
+
metadatas = chunks.map do |c|
|
|
170
|
+
# Serialize +source+ as a reserved key in Chroma's
|
|
171
|
+
# +metadata+; merge in the user's metadata Hash with
|
|
172
|
+
# keys stringified for JSON round-trip stability.
|
|
173
|
+
base = { 'source' => c.source }
|
|
174
|
+
c.metadata.each { |k, v| base[k.to_s] = v }
|
|
175
|
+
base
|
|
176
|
+
end
|
|
177
|
+
|
|
178
|
+
body = {
|
|
179
|
+
ids: chunks.map(&:id),
|
|
180
|
+
embeddings: vectors,
|
|
181
|
+
documents: chunks.map(&:text),
|
|
182
|
+
metadatas: metadatas
|
|
183
|
+
}
|
|
184
|
+
|
|
185
|
+
post_json("#{collection_path}/upsert", body)
|
|
186
|
+
nil
|
|
187
|
+
end
|
|
188
|
+
|
|
189
|
+
# k-NN query by cosine similarity. Returns at most
|
|
190
|
+
# +top_k+ {Backend::Result}s descending by score.
|
|
191
|
+
# +score+ is +1 - cosine_distance+ so the value matches
|
|
192
|
+
# {InMemory}'s cosine-similarity scale.
|
|
193
|
+
#
|
|
194
|
+
# @param vector [Array<Float>]
|
|
195
|
+
# @param top_k [Integer]
|
|
196
|
+
# @return [Array<Backend::Result>]
|
|
197
|
+
# @raise [ArgumentError] on non-positive +top_k+.
|
|
198
|
+
# @raise [RuntimeError] on HTTP failure.
|
|
199
|
+
def query(vector:, top_k:)
|
|
200
|
+
raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0
|
|
201
|
+
|
|
202
|
+
# If we've never upserted, the collection doesn't
|
|
203
|
+
# exist yet — semantic answer is "no hits."
|
|
204
|
+
return [] if @collection_id.nil? && !collection_exists?
|
|
205
|
+
|
|
206
|
+
response_body = post_json("#{collection_path}/query", {
|
|
207
|
+
query_embeddings: [vector],
|
|
208
|
+
n_results: top_k,
|
|
209
|
+
include: %w[documents metadatas distances]
|
|
210
|
+
})
|
|
211
|
+
|
|
212
|
+
ids = (response_body['ids'] || [[]]).first || []
|
|
213
|
+
docs = (response_body['documents'] || [[]]).first || []
|
|
214
|
+
metas = (response_body['metadatas'] || [[]]).first || []
|
|
215
|
+
dists = (response_body['distances'] || [[]]).first || []
|
|
216
|
+
|
|
217
|
+
ids.each_with_index.map do |id, i|
|
|
218
|
+
meta = metas[i] || {}
|
|
219
|
+
# Pull +source+ back out of the metadata blob;
|
|
220
|
+
# symbolize the remaining keys for round-trip
|
|
221
|
+
# consistency with InMemory.
|
|
222
|
+
source = meta['source'] || ''
|
|
223
|
+
chunk_meta = {}
|
|
224
|
+
meta.each do |k, v|
|
|
225
|
+
next if k == 'source'
|
|
226
|
+
|
|
227
|
+
chunk_meta[k.to_sym] = v
|
|
228
|
+
end
|
|
229
|
+
|
|
230
|
+
chunk = Chunk.new(id: id, source: source, text: docs[i] || '', metadata: chunk_meta)
|
|
231
|
+
Result.new(chunk: chunk, score: 1.0 - dists[i].to_f)
|
|
232
|
+
end
|
|
233
|
+
end
|
|
234
|
+
|
|
235
|
+
# Drop the collection. Next +#upsert+ re-creates from
|
|
236
|
+
# scratch — that's the v1 nuke-and-reload reindex path
|
|
237
|
+
# the {Indexer} drives. No-op if no collection was ever
|
|
238
|
+
# created (consistent with {InMemory}'s clear-on-empty
|
|
239
|
+
# behaviour). 404 on the DELETE is treated as "already
|
|
240
|
+
# gone" — idempotent.
|
|
241
|
+
#
|
|
242
|
+
# @return [void]
|
|
243
|
+
def delete_all
|
|
244
|
+
return nil if @collection_id.nil? && !collection_exists?
|
|
245
|
+
|
|
246
|
+
response = @connection.delete(collection_path)
|
|
247
|
+
unless [200, 204, 404].include?(response.status)
|
|
248
|
+
raise "Backend::Chroma: DELETE #{collection_path} returned " \
|
|
249
|
+
"HTTP #{response.status}: #{response.body.inspect}"
|
|
250
|
+
end
|
|
251
|
+
@collection_id = nil
|
|
252
|
+
nil
|
|
253
|
+
end
|
|
254
|
+
|
|
255
|
+
# @return [Integer] current chunk count. Zero before the
|
|
256
|
+
# first +#upsert+.
|
|
257
|
+
def count
|
|
258
|
+
return 0 if @collection_id.nil? && !collection_exists?
|
|
259
|
+
|
|
260
|
+
response = @connection.get("#{collection_path}/count")
|
|
261
|
+
unless response.status == 200
|
|
262
|
+
raise "Backend::Chroma: GET #{collection_path}/count returned " \
|
|
263
|
+
"HTTP #{response.status}: #{response.body.inspect}"
|
|
264
|
+
end
|
|
265
|
+
|
|
266
|
+
body = response.body
|
|
267
|
+
# Chroma v2 returns the count as a bare integer.
|
|
268
|
+
return body if body.is_a?(Integer)
|
|
269
|
+
return body['count'] if body.is_a?(Hash) && body['count'].is_a?(Integer)
|
|
270
|
+
|
|
271
|
+
raise "Backend::Chroma: count response was not an Integer (got #{body.inspect})"
|
|
272
|
+
end
|
|
273
|
+
|
|
274
|
+
private
|
|
275
|
+
|
|
276
|
+
def collections_path
|
|
277
|
+
"/api/v2/tenants/#{@tenant}/databases/#{@database}/collections"
|
|
278
|
+
end
|
|
279
|
+
|
|
280
|
+
def collection_path
|
|
281
|
+
raise 'Backend::Chroma: collection_id not yet resolved' unless @collection_id
|
|
282
|
+
|
|
283
|
+
"#{collections_path}/#{@collection_id}"
|
|
284
|
+
end
|
|
285
|
+
|
|
286
|
+
# Idempotent get-or-create against Chroma. Sets
|
|
287
|
+
# +@collection_id+ on success. Returns the id String.
|
|
288
|
+
def ensure_collection!
|
|
289
|
+
return @collection_id if @collection_id
|
|
290
|
+
|
|
291
|
+
body = post_json(collections_path, {
|
|
292
|
+
name: @collection_name,
|
|
293
|
+
configuration: { hnsw: { space: 'cosine' } },
|
|
294
|
+
get_or_create: true
|
|
295
|
+
})
|
|
296
|
+
|
|
297
|
+
id = body.is_a?(Hash) ? body['id'] : nil
|
|
298
|
+
raise "Backend::Chroma: collection-create response missing 'id' (got #{body.inspect})" unless id
|
|
299
|
+
|
|
300
|
+
@collection_id = id
|
|
301
|
+
end
|
|
302
|
+
|
|
303
|
+
# Probe whether the collection exists without creating
|
|
304
|
+
# it — used by +#query+ / +#count+ / +#delete_all+ to
|
|
305
|
+
# short-circuit when the user queries before any upsert
|
|
306
|
+
# has happened. Side-effect: caches +@collection_id+
|
|
307
|
+
# when found.
|
|
308
|
+
def collection_exists?
|
|
309
|
+
response = @connection.get(collections_path)
|
|
310
|
+
unless response.status == 200
|
|
311
|
+
raise "Backend::Chroma: GET #{collections_path} returned " \
|
|
312
|
+
"HTTP #{response.status}: #{response.body.inspect}"
|
|
313
|
+
end
|
|
314
|
+
|
|
315
|
+
list = response.body
|
|
316
|
+
return false unless list.is_a?(Array)
|
|
317
|
+
|
|
318
|
+
match = list.find { |c| c.is_a?(Hash) && c['name'] == @collection_name }
|
|
319
|
+
return false unless match
|
|
320
|
+
|
|
321
|
+
@collection_id = match['id']
|
|
322
|
+
true
|
|
323
|
+
rescue Faraday::Error => e
|
|
324
|
+
raise "Backend::Chroma: #{e.class.name.split('::').last} " \
|
|
325
|
+
"calling #{collections_path}: #{e.message}"
|
|
326
|
+
end
|
|
327
|
+
|
|
328
|
+
# Tiny JSON-POST helper shared by upsert / query /
|
|
329
|
+
# ensure_collection. Centralises the error-shape and
|
|
330
|
+
# Faraday::Error wrapping.
|
|
331
|
+
def post_json(path, body)
|
|
332
|
+
response = @connection.post(path) do |req|
|
|
333
|
+
req.headers['Content-Type'] = 'application/json'
|
|
334
|
+
req.body = body
|
|
335
|
+
end
|
|
336
|
+
|
|
337
|
+
unless [200, 201].include?(response.status)
|
|
338
|
+
raise "Backend::Chroma: POST #{path} returned " \
|
|
339
|
+
"HTTP #{response.status}: #{response.body.inspect}"
|
|
340
|
+
end
|
|
341
|
+
|
|
342
|
+
response.body
|
|
343
|
+
rescue Faraday::Error => e
|
|
344
|
+
raise "Backend::Chroma: #{e.class.name.split('::').last} " \
|
|
345
|
+
"calling #{path}: #{e.message}"
|
|
346
|
+
end
|
|
347
|
+
end
|
|
348
|
+
end
|
|
349
|
+
end
|
|
350
|
+
end
|
|
@@ -0,0 +1,149 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Pikuri
|
|
4
|
+
module VectorDb
|
|
5
|
+
module Backend
|
|
6
|
+
# Pure-Ruby vector store. The educational default backend;
|
|
7
|
+
# IDEAS.md §"Vector DB / RAG" frames it as the "small enough
|
|
8
|
+
# to audit" first stop the demo + guide walk through before
|
|
9
|
+
# promoting users to +Chroma+ for persistence.
|
|
10
|
+
#
|
|
11
|
+
# == What it does
|
|
12
|
+
#
|
|
13
|
+
# Holds an in-memory Hash from chunk id to +[Chunk, vector]+;
|
|
14
|
+
# +#query+ computes cosine similarity against every stored
|
|
15
|
+
# vector, sorts descending, returns the top-k as
|
|
16
|
+
# +Backend::Result+ instances. O(n) per query, where n is
|
|
17
|
+
# the number of stored chunks. Fine for thousands of chunks
|
|
18
|
+
# (a personal notes folder, a single product's docs); slow
|
|
19
|
+
# for millions (a full corporate knowledge base — that's the
|
|
20
|
+
# +Chroma+ use case).
|
|
21
|
+
#
|
|
22
|
+
# == What it deliberately doesn't do
|
|
23
|
+
#
|
|
24
|
+
# * **No persistence.** RAM-only, intentional — the user who
|
|
25
|
+
# wants persistence picks +Chroma+. Reloads from sources on
|
|
26
|
+
# every boot, which makes the in-memory backend the natural
|
|
27
|
+
# teaching shape: the same code path the demo binary walks
|
|
28
|
+
# on startup is the one the user inspects when they're
|
|
29
|
+
# learning what "indexing" actually means.
|
|
30
|
+
# * **No approximate search.** Exhaustive scan. Approximate
|
|
31
|
+
# nearest neighbor (HNSW, IVF) adds complexity that doesn't
|
|
32
|
+
# teach anything additional once the cosine math is clear.
|
|
33
|
+
# * **No thread safety.** {Indexer} runs single-threaded
|
|
34
|
+
# during a boot or reindex; {Search} calls +#query+ from
|
|
35
|
+
# the agent's main thread. No concurrent access today.
|
|
36
|
+
#
|
|
37
|
+
# == Cosine, not dot product
|
|
38
|
+
#
|
|
39
|
+
# Some embedders return pre-normalized vectors (text-embedding-3,
|
|
40
|
+
# most sentence-transformers); others don't. Cosine normalizes
|
|
41
|
+
# at compute time, so the backend works regardless of whether
|
|
42
|
+
# the embedder did. The readable two-pass form below (compute
|
|
43
|
+
# dot + magnitudes separately) is intentional over the
|
|
44
|
+
# single-loop micro-optimization — this is the file the
|
|
45
|
+
# newcomer reads to understand what's happening.
|
|
46
|
+
class InMemory
|
|
47
|
+
# @return [InMemory]
|
|
48
|
+
def initialize
|
|
49
|
+
# id (String) → [Chunk, vector (Array<Float>)]
|
|
50
|
+
@entries = {}
|
|
51
|
+
# Dimension of every stored vector. +nil+ before the first
|
|
52
|
+
# +#upsert+; locked to the dim of the first vector seen and
|
|
53
|
+
# enforced for every subsequent +#upsert+ + +#query+ — see
|
|
54
|
+
# the Backend protocol's "Vector-dim contract" yardoc.
|
|
55
|
+
@dim = nil
|
|
56
|
+
end
|
|
57
|
+
|
|
58
|
+
# Insert-or-replace by +chunk.id+. Parallel arrays of
|
|
59
|
+
# equal length; raises on empty input or length mismatch.
|
|
60
|
+
# Vector dimension is locked at first upsert; raises on
|
|
61
|
+
# any subsequent vector of a different dim.
|
|
62
|
+
#
|
|
63
|
+
# @param chunks [Array<Chunk>]
|
|
64
|
+
# @param vectors [Array<Array<Float>>]
|
|
65
|
+
# @return [void]
|
|
66
|
+
# @raise [ArgumentError] on empty input, length mismatch,
|
|
67
|
+
# or vector-dim mismatch.
|
|
68
|
+
def upsert(chunks:, vectors:)
|
|
69
|
+
raise ArgumentError, 'upsert called with empty chunks/vectors' if chunks.empty?
|
|
70
|
+
if chunks.size != vectors.size
|
|
71
|
+
raise ArgumentError, "size mismatch: #{chunks.size} chunks vs #{vectors.size} vectors"
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
expected = @dim || vectors.first.size
|
|
75
|
+
vectors.each_with_index do |v, i|
|
|
76
|
+
next if v.size == expected
|
|
77
|
+
|
|
78
|
+
raise ArgumentError, "vector #{i} has dim #{v.size}, expected #{expected}"
|
|
79
|
+
end
|
|
80
|
+
@dim ||= expected
|
|
81
|
+
|
|
82
|
+
chunks.zip(vectors).each { |chunk, vector| @entries[chunk.id] = [chunk, vector] }
|
|
83
|
+
nil
|
|
84
|
+
end
|
|
85
|
+
|
|
86
|
+
# Cosine-similarity nearest neighbor search. Returns the
|
|
87
|
+
# top-k {Backend::Result}s in descending score order;
|
|
88
|
+
# empty array when the store has no entries.
|
|
89
|
+
#
|
|
90
|
+
# @param vector [Array<Float>] query vector; must match
|
|
91
|
+
# the stored vector dim.
|
|
92
|
+
# @param top_k [Integer] number of results to return;
|
|
93
|
+
# must be positive.
|
|
94
|
+
# @return [Array<Backend::Result>]
|
|
95
|
+
# @raise [ArgumentError] on +top_k+ <= 0 or query-vector
|
|
96
|
+
# dim mismatch.
|
|
97
|
+
def query(vector:, top_k:)
|
|
98
|
+
raise ArgumentError, "top_k must be positive (got #{top_k})" if top_k <= 0
|
|
99
|
+
return [] if @entries.empty?
|
|
100
|
+
|
|
101
|
+
if vector.size != @dim
|
|
102
|
+
raise ArgumentError, "query vector dim #{vector.size}, stored dim #{@dim}"
|
|
103
|
+
end
|
|
104
|
+
|
|
105
|
+
scored = @entries.values.map do |chunk, stored|
|
|
106
|
+
Result.new(chunk: chunk, score: cosine(vector, stored))
|
|
107
|
+
end
|
|
108
|
+
scored.sort_by { |r| -r.score }.first(top_k)
|
|
109
|
+
end
|
|
110
|
+
|
|
111
|
+
# Drop every stored chunk. Used by the v1 nuke-and-reload
|
|
112
|
+
# reindex flow; the embedder dim lock is also released so
|
|
113
|
+
# a reindex with a different embedder model starts clean.
|
|
114
|
+
#
|
|
115
|
+
# @return [void]
|
|
116
|
+
def delete_all
|
|
117
|
+
@entries.clear
|
|
118
|
+
@dim = nil
|
|
119
|
+
nil
|
|
120
|
+
end
|
|
121
|
+
|
|
122
|
+
# @return [Integer] current chunk count.
|
|
123
|
+
def count
|
|
124
|
+
@entries.size
|
|
125
|
+
end
|
|
126
|
+
|
|
127
|
+
private
|
|
128
|
+
|
|
129
|
+
# Cosine similarity. Two-pass form is the readable shape;
|
|
130
|
+
# micro-optimizing into a single loop saves an array
|
|
131
|
+
# traversal but obscures what's happening for the reader
|
|
132
|
+
# who's here to learn how a vector store works.
|
|
133
|
+
#
|
|
134
|
+
# @param a [Array<Float>]
|
|
135
|
+
# @param b [Array<Float>]
|
|
136
|
+
# @return [Float] cosine in +[-1.0, 1.0]+, or +0.0+ if
|
|
137
|
+
# either vector is zero (degenerate but valid input).
|
|
138
|
+
def cosine(a, b)
|
|
139
|
+
dot = a.zip(b).sum { |x, y| x * y }
|
|
140
|
+
mag_a = Math.sqrt(a.sum { |x| x * x })
|
|
141
|
+
mag_b = Math.sqrt(b.sum { |x| x * x })
|
|
142
|
+
return 0.0 if mag_a.zero? || mag_b.zero?
|
|
143
|
+
|
|
144
|
+
dot / (mag_a * mag_b)
|
|
145
|
+
end
|
|
146
|
+
end
|
|
147
|
+
end
|
|
148
|
+
end
|
|
149
|
+
end
|
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Pikuri
|
|
4
|
+
module VectorDb
|
|
5
|
+
module Backend
|
|
6
|
+
# A single query hit: the {Chunk} that matched + its cosine
|
|
7
|
+
# similarity score. Returned in descending-score order from
|
|
8
|
+
# any backend's +#query+ method.
|
|
9
|
+
#
|
|
10
|
+
# == Fields
|
|
11
|
+
#
|
|
12
|
+
# * +chunk+ — the stored {Chunk} (id, text, metadata).
|
|
13
|
+
# * +score+ — cosine similarity as +Float+, range
|
|
14
|
+
# +[-1.0, 1.0]+. For typical sentence-embedder vectors
|
|
15
|
+
# (all-positive component dimensions after the softmax /
|
|
16
|
+
# normalization most models bake in) scores stay in
|
|
17
|
+
# +[0.0, 1.0]+ in practice. 1.0 is exact match; 0.0 is
|
|
18
|
+
# orthogonal.
|
|
19
|
+
#
|
|
20
|
+
# == Citation
|
|
21
|
+
#
|
|
22
|
+
# {Search} formats hits for the LLM as
|
|
23
|
+
# +result.chunk.source+ (the citation — relative path, URL,
|
|
24
|
+
# or doc ID — see {Chunk}'s +source+ field) +
|
|
25
|
+
# +result.chunk.text+ (the snippet) + the score. The +id+
|
|
26
|
+
# field is opaque and never surfaced; the +metadata+ Hash
|
|
27
|
+
# carries optional extras like offset / page / anchor for
|
|
28
|
+
# callers that want to deep-link.
|
|
29
|
+
#
|
|
30
|
+
# The +score+ is cosine *as returned by a backend's
|
|
31
|
+
# +#query+*. {Search} substitutes the reranker's relevance
|
|
32
|
+
# score here (via +Data#with+) before formatting when a
|
|
33
|
+
# reranker reorders the candidates, so a surfaced score is
|
|
34
|
+
# cosine only in vector-only mode — see {Search}'s
|
|
35
|
+
# "Which score is shown" section.
|
|
36
|
+
#
|
|
37
|
+
# == Why a Data.define
|
|
38
|
+
#
|
|
39
|
+
# Same convention as {Chunk}. A tuple +[chunk, score]+ or
|
|
40
|
+
# a +{chunk:, score:}+ Hash would work at runtime but
|
|
41
|
+
# +result.chunk+ / +result.score+ reads cleaner at call
|
|
42
|
+
# sites, and value-equality is occasionally useful in specs.
|
|
43
|
+
Result = Data.define(:chunk, :score)
|
|
44
|
+
end
|
|
45
|
+
end
|
|
46
|
+
end
|