robot_lab-document_store 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +173 -0
- data/CHANGELOG.md +19 -0
- data/README.md +1 -1
- data/Rakefile +111 -3
- data/docs/api_reference.md +186 -0
- data/docs/assets/architecture.svg +140 -0
- data/docs/getting_started.md +106 -0
- data/docs/how_it_works.md +141 -0
- data/docs/index.md +24 -41
- data/docs/pluggable_backends_design.md +66 -0
- data/docs/rag_patterns.md +198 -0
- data/examples/{26_document_store.rb → 01_basic_usage.rb} +13 -9
- data/lib/robot_lab/document_store/version.rb +1 -1
- data/lib/robot_lab/document_store.rb +111 -18
- data/mkdocs.yml +5 -0
- metadata +14 -7
- /data/examples/{26_document_store → 01_basic_usage}/api_versioning_adr.md +0 -0
- /data/examples/{26_document_store → 01_basic_usage}/incident_postmortem.md +0 -0
- /data/examples/{26_document_store → 01_basic_usage}/postgres_runbook.md +0 -0
- /data/examples/{26_document_store → 01_basic_usage}/redis_caching_guide.md +0 -0
- /data/examples/{26_document_store → 01_basic_usage}/sidekiq_guide.md +0 -0
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# Getting Started
|
|
2
|
+
|
|
3
|
+
## Prerequisites
|
|
4
|
+
|
|
5
|
+
- Ruby 3.1+
|
|
6
|
+
- **fastembed** (recommended) — requires a platform that can run ONNX Runtime (x86_64 and ARM64 macOS/Linux). On first use the ~23 MB `BAAI/bge-small-en-v1.5` model file is downloaded and cached in `~/.cache/fastembed`.
|
|
7
|
+
- Without fastembed the store still works using the built-in TF-IDF fallback (see [Fallback Mode](#fallback-mode) below).
|
|
8
|
+
|
|
9
|
+
## Installation
|
|
10
|
+
|
|
11
|
+
Add to your `Gemfile`:
|
|
12
|
+
|
|
13
|
+
```ruby
|
|
14
|
+
gem "robot_lab-document_store"
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
Then:
|
|
18
|
+
|
|
19
|
+
```sh
|
|
20
|
+
bundle install
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
Or install directly:
|
|
24
|
+
|
|
25
|
+
```sh
|
|
26
|
+
gem install robot_lab-document_store
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
### Installing fastembed
|
|
30
|
+
|
|
31
|
+
fastembed is an optional dependency. To get semantic search quality, install it:
|
|
32
|
+
|
|
33
|
+
```ruby
|
|
34
|
+
# Gemfile
|
|
35
|
+
gem "fastembed"
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
```sh
|
|
39
|
+
bundle install
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
The ONNX model is downloaded on the first embed call and cached locally. Subsequent starts reuse the cache — no download needed.
|
|
43
|
+
|
|
44
|
+
## First Run
|
|
45
|
+
|
|
46
|
+
```ruby
|
|
47
|
+
require "robot_lab/document_store"
|
|
48
|
+
|
|
49
|
+
store = RobotLab::DocumentStore.new
|
|
50
|
+
|
|
51
|
+
# Store documents — embedding happens here (model downloads if needed)
|
|
52
|
+
store.store(:ruby_intro, "Ruby is a dynamic, open source programming language.")
|
|
53
|
+
store.store(:python_intro, "Python is widely used in data science and AI.")
|
|
54
|
+
store.store(:js_intro, "JavaScript runs in the browser and powers Node.js.")
|
|
55
|
+
|
|
56
|
+
# Search — returns Array of { key:, text:, score: } hashes
|
|
57
|
+
results = store.search("Which language is used for machine learning?", limit: 2)
|
|
58
|
+
|
|
59
|
+
results.each do |r|
|
|
60
|
+
puts "#{r[:key]} (#{r[:score].round(3)})"
|
|
61
|
+
end
|
|
62
|
+
# => python_intro (0.871)
|
|
63
|
+
# => ruby_intro (0.634)
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
!!! note "Model download on first run"
|
|
67
|
+
The first call to `store` or `search` triggers a ~23 MB ONNX model download.
|
|
68
|
+
All subsequent runs reuse the cached model. Set `FASTEMBED_CACHE_PATH` to control
|
|
69
|
+
the cache location.
|
|
70
|
+
|
|
71
|
+
## Fallback Mode
|
|
72
|
+
|
|
73
|
+
When `fastembed` is not installed, `DocumentStore` automatically switches to a
|
|
74
|
+
TF-IDF word-frequency embedder. No configuration needed — the switch is silent.
|
|
75
|
+
|
|
76
|
+
```text
|
|
77
|
+
fastembed installed?
|
|
78
|
+
YES → dense vector embeddings via BAAI/bge-small-en-v1.5
|
|
79
|
+
NO → sparse TF-IDF bag-of-words with Porter-style stemming
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
The fallback is lower quality — it relies on lexical overlap rather than semantic
|
|
83
|
+
understanding — but it works offline with no model downloads, making it well-suited
|
|
84
|
+
for development, CI, and test environments.
|
|
85
|
+
|
|
86
|
+
```ruby
|
|
87
|
+
# Works identically regardless of whether fastembed is installed
|
|
88
|
+
store = RobotLab::DocumentStore.new
|
|
89
|
+
store.store(:doc, "Ruby programming language")
|
|
90
|
+
store.search("Ruby development", limit: 1)
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## Running the Example
|
|
94
|
+
|
|
95
|
+
The gem ships with a self-contained example that demonstrates all core features:
|
|
96
|
+
|
|
97
|
+
```sh
|
|
98
|
+
bundle exec ruby examples/01_basic_usage.rb
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
This loads five engineering runbook documents, runs several semantic queries,
|
|
102
|
+
demonstrates deletion, and shows the RobotLab Memory integration.
|
|
103
|
+
|
|
104
|
+
!!! note
|
|
105
|
+
The example requires `robot_lab` core for the Memory integration section.
|
|
106
|
+
The standalone `DocumentStore` section runs without it.
|
|
@@ -0,0 +1,141 @@
|
|
|
1
|
+
# How It Works
|
|
2
|
+
|
|
3
|
+
## Architecture Overview
|
|
4
|
+
|
|
5
|
+

|
|
6
|
+
|
|
7
|
+
Two operations drive the store: **`store`** (embed and save) and **`search`** (embed and rank). Both paths share the same embedding backend — fastembed when available, TF-IDF otherwise.
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Embedding Backend Selection
|
|
12
|
+
|
|
13
|
+
On load, `DocumentStore` attempts to `require 'fastembed'`. The result is captured in `FASTEMBED_AVAILABLE` and drives every subsequent embed call. There is no runtime switching — the backend is fixed for the lifetime of the object.
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
require 'fastembed'
|
|
17
|
+
→ success : FASTEMBED_AVAILABLE = true → dense vector path
|
|
18
|
+
→ LoadError: FASTEMBED_AVAILABLE = false → TF-IDF fallback path
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Fastembed Path — Dense Vectors
|
|
24
|
+
|
|
25
|
+
When fastembed is available, the store uses [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) by default — a 33M parameter English bi-encoder that produces 384-dimensional float vectors.
|
|
26
|
+
|
|
27
|
+
### Asymmetric Embedding
|
|
28
|
+
|
|
29
|
+
The model uses **separate encoders** for passages (stored documents) and queries (search terms). This is deliberate: a passage encoder is optimised to capture what a document *contains*, while a query encoder is optimised to capture what a user *wants*. Using the same encoder for both degrades recall on semantically paraphrased queries.
|
|
30
|
+
|
|
31
|
+
| Call | Encoder used | Purpose |
|
|
32
|
+
|------|-------------|---------|
|
|
33
|
+
| `store(key, text)` | `passage_embed` | Captures document content |
|
|
34
|
+
| `search(query)` | `query_embed` | Captures search intent |
|
|
35
|
+
|
|
36
|
+
### Cosine Similarity
|
|
37
|
+
|
|
38
|
+
Given a query vector **q** and a stored passage vector **p**, similarity is:
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
similarity = dot(q, p) / (‖q‖ · ‖p‖)
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
This is the cosine of the angle between the vectors. A value of `1.0` means identical
|
|
45
|
+
direction (maximum similarity); `0.0` means orthogonal (no relationship). The
|
|
46
|
+
BGE model normalises its output to unit length, so the division reduces to a simple
|
|
47
|
+
dot product — but `DocumentStore` performs explicit normalisation anyway to remain
|
|
48
|
+
correct regardless of the model used.
|
|
49
|
+
|
|
50
|
+
**Defensive guards** return `0.0` for nil vectors, empty vectors, or length mismatches.
|
|
51
|
+
These protect against partial or corrupted embed results without raising exceptions.
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## TF-IDF Fallback Path — Sparse Vectors
|
|
56
|
+
|
|
57
|
+
When fastembed is unavailable, each document is converted to a sparse
|
|
58
|
+
`Hash{String => Float}` where keys are stemmed terms and values are L2-normalised
|
|
59
|
+
term frequencies.
|
|
60
|
+
|
|
61
|
+
### Processing Pipeline
|
|
62
|
+
|
|
63
|
+
```
|
|
64
|
+
raw text
|
|
65
|
+
→ downcase
|
|
66
|
+
→ tokenise /[a-z]+/ (ASCII only — Unicode letters are dropped)
|
|
67
|
+
→ remove STOP_WORDS (38 common English words: a, an, the, is, are, …)
|
|
68
|
+
→ Porter-style stem (strips: -ies, -ness, -ment, -tion, -ing, -ed, -er, -ly, -s)
|
|
69
|
+
→ count term frequencies
|
|
70
|
+
→ L2-normalise counts (divide each count by the Euclidean norm of the count vector)
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
The result is a unit-length sparse vector. Similarity between two sparse vectors
|
|
74
|
+
uses the same cosine formula, but computed efficiently by iterating only over the
|
|
75
|
+
keys present in the smaller vector.
|
|
76
|
+
|
|
77
|
+
### Stemmer
|
|
78
|
+
|
|
79
|
+
The stemmer applies suffix rules in priority order and stops at the first match:
|
|
80
|
+
|
|
81
|
+
| Rule | Example |
|
|
82
|
+
|------|---------|
|
|
83
|
+
| `-ies` → `-y` | `activities` → `activiti` |
|
|
84
|
+
| `-ness` → `` | `darkness` → `dark` |
|
|
85
|
+
| `-ment` → `` | `development` → `develop` |
|
|
86
|
+
| `-tion` → `` | `configuration` → `configura` |
|
|
87
|
+
| `-ing` → `` | `running` → `runn` |
|
|
88
|
+
| `-ed` → `` | `deployed` → `deploy` |
|
|
89
|
+
| `-er` → `` | `server` → `serv` |
|
|
90
|
+
| `-ly` → `` | `quickly` → `quick` |
|
|
91
|
+
| `-s` → `` | `robots` → `robot` |
|
|
92
|
+
|
|
93
|
+
This is intentionally simple — not a full Porter stemmer. Its purpose is to improve
|
|
94
|
+
lexical recall during development and testing, not to rival production semantic search.
|
|
95
|
+
|
|
96
|
+
### Limitations of the Fallback
|
|
97
|
+
|
|
98
|
+
- **No semantic understanding** — synonyms, paraphrases, and cross-lingual queries will not match
|
|
99
|
+
- **ASCII only** — non-ASCII characters (accented letters, CJK, emoji) are silently dropped
|
|
100
|
+
- **Order-insensitive** — "connection pool exhausted" and "exhausted pool connection" score identically
|
|
101
|
+
|
|
102
|
+
For any production use case, install fastembed.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## Thread Safety
|
|
107
|
+
|
|
108
|
+
All public methods acquire an internal `Mutex` before touching `@documents`.
|
|
109
|
+
Embedding (which can be slow — tens of milliseconds for the first call) happens
|
|
110
|
+
**outside** the lock to avoid blocking concurrent readers.
|
|
111
|
+
|
|
112
|
+
```ruby
|
|
113
|
+
def store(key, text)
|
|
114
|
+
key = key.to_sym
|
|
115
|
+
vector = passage_vector(text) # ← compute outside the lock
|
|
116
|
+
@mutex.synchronize do
|
|
117
|
+
@documents[key] = { text:, vector: } # ← write inside the lock
|
|
118
|
+
end
|
|
119
|
+
self
|
|
120
|
+
end
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
This means multiple threads can embed documents concurrently. The lock only
|
|
124
|
+
serialises the final hash write and all reads.
|
|
125
|
+
|
|
126
|
+
---
|
|
127
|
+
|
|
128
|
+
## RobotLab Extension Registration
|
|
129
|
+
|
|
130
|
+
The file bottom contains:
|
|
131
|
+
|
|
132
|
+
```ruby
|
|
133
|
+
if defined?(RobotLab) && RobotLab.respond_to?(:register_extension)
|
|
134
|
+
RobotLab.register_extension(:document_store, RobotLab::DocumentStore)
|
|
135
|
+
end
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
This registers the class with robot_lab core when both gems are loaded together,
|
|
139
|
+
enabling `Memory#store_document`, `Memory#search_documents`, and `Memory#document_keys`
|
|
140
|
+
on any `RobotLab::Memory` instance. The guard makes the file safe to `require`
|
|
141
|
+
standalone without robot_lab present.
|
data/docs/index.md
CHANGED
|
@@ -2,57 +2,40 @@
|
|
|
2
2
|
|
|
3
3
|
Embedding-based semantic document search for the [RobotLab](https://github.com/MadBomber/robot_lab) LLM agent framework.
|
|
4
4
|
|
|
5
|
-
|
|
6
|
-
|
|
5
|
+
!!! warning "Under active development"
|
|
6
|
+
APIs may change without notice until v1.0.
|
|
7
7
|
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
`RobotLab::DocumentStore` is a thread-safe, in-memory vector store backed by [fastembed](https://github.com/Anush008/fastembed-ruby) embeddings and cosine similarity search. It supports:
|
|
11
|
-
|
|
12
|
-
- **`store(key, text)`** — embed and store a document under a symbol key
|
|
13
|
-
- **`search(query, limit:)`** — return the top-N most similar documents by cosine similarity
|
|
14
|
-
- **`delete(key)`** / **`clear`** — remove individual entries or wipe the store
|
|
15
|
-
- **Asymmetric embedding** — passage embeddings for storage, query embeddings for retrieval
|
|
16
|
-
|
|
17
|
-
## Installation
|
|
18
|
-
|
|
19
|
-
Add to your Gemfile:
|
|
20
|
-
|
|
21
|
-
```ruby
|
|
22
|
-
gem "robot_lab-document_store"
|
|
23
|
-
```
|
|
24
|
-
|
|
25
|
-
## Quick Example
|
|
8
|
+
`RobotLab::DocumentStore` is a thread-safe, in-memory vector store. Store arbitrary text documents, then retrieve the most relevant ones by natural-language query — no keyword overlap required.
|
|
26
9
|
|
|
27
10
|
```ruby
|
|
28
11
|
require "robot_lab/document_store"
|
|
29
12
|
|
|
30
13
|
store = RobotLab::DocumentStore.new
|
|
31
14
|
|
|
32
|
-
store.store(:
|
|
33
|
-
store.store(:
|
|
34
|
-
store.store(:
|
|
15
|
+
store.store(:sidekiq_guide, File.read("docs/sidekiq.md"))
|
|
16
|
+
store.store(:postgres_guide, File.read("docs/postgres.md"))
|
|
17
|
+
store.store(:incident_report, File.read("docs/outage_2024.md"))
|
|
35
18
|
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
# => beta (score: 0.872)
|
|
41
|
-
# => alpha (score: 0.641)
|
|
19
|
+
hits = store.search("Jobs keep piling up when Stripe is down", limit: 2)
|
|
20
|
+
hits.each { |r| puts "#{r[:key]} score=#{r[:score].round(3)}" }
|
|
21
|
+
# => sidekiq_guide score=0.847
|
|
22
|
+
# => incident_report score=0.612
|
|
42
23
|
```
|
|
43
24
|
|
|
44
|
-
##
|
|
45
|
-
|
|
46
|
-
```ruby
|
|
47
|
-
store = RobotLab::DocumentStore.new(
|
|
48
|
-
model_name: "BAAI/bge-small-en-v1.5"
|
|
49
|
-
)
|
|
50
|
-
```
|
|
25
|
+
## Features
|
|
51
26
|
|
|
52
|
-
|
|
27
|
+
| Feature | Detail |
|
|
28
|
+
|---------|--------|
|
|
29
|
+
| **Semantic search** | Cosine similarity over dense vector embeddings |
|
|
30
|
+
| **Asymmetric embedding** | Separate passage/query embeddings for higher recall |
|
|
31
|
+
| **Thread-safe** | Internal `Mutex` — safe for Puma, Sidekiq, Ractor workers |
|
|
32
|
+
| **Zero-config fallback** | TF-IDF word-frequency search when fastembed is unavailable |
|
|
33
|
+
| **Lazy model init** | ONNX model downloads on first use, cached locally |
|
|
34
|
+
| **RobotLab integration** | Drop-in via `Memory#store_document` / `Memory#search_documents` |
|
|
53
35
|
|
|
54
|
-
##
|
|
36
|
+
## Navigation
|
|
55
37
|
|
|
56
|
-
- [
|
|
57
|
-
- [
|
|
58
|
-
- [
|
|
38
|
+
- [Getting Started](getting_started.md) — install, first run, fallback mode
|
|
39
|
+
- [API Reference](api_reference.md) — every public method documented
|
|
40
|
+
- [How It Works](how_it_works.md) — embeddings, cosine similarity, TF-IDF fallback
|
|
41
|
+
- [RAG Patterns](rag_patterns.md) — retrieval-augmented generation recipes
|
|
@@ -0,0 +1,66 @@
|
|
|
1
|
+
# Pluggable Backend Architecture — Design Discussion
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-05-14
|
|
4
|
+
**Status:** Parked — resume when time allows
|
|
5
|
+
|
|
6
|
+
## The Vision
|
|
7
|
+
|
|
8
|
+
`DocumentStore` should become an **abstract interface** that defines the storage contract but does not implement any backend itself. The gem ships with two concrete backends; applications supply their own.
|
|
9
|
+
|
|
10
|
+
### Backends shipping with the gem
|
|
11
|
+
|
|
12
|
+
| Class | Description |
|
|
13
|
+
|---|---|
|
|
14
|
+
| `DocumentStore::Memory` | In-memory store — essentially the current implementation (fastembed/TF-IDF). No persistence. |
|
|
15
|
+
| `DocumentStore::FileSystem` | File-backed store — YAML persistence, similar to `robot_lab-durable`'s current `Store` class. |
|
|
16
|
+
|
|
17
|
+
### Backends supplied by applications
|
|
18
|
+
|
|
19
|
+
Applications implement their own adapters (same interface) for:
|
|
20
|
+
- `DocumentStore::Redis`
|
|
21
|
+
- `DocumentStore::Database`
|
|
22
|
+
- Any other backend
|
|
23
|
+
|
|
24
|
+
All backend code lives in the application, not in this gem.
|
|
25
|
+
|
|
26
|
+
## Motivation
|
|
27
|
+
|
|
28
|
+
`robot_lab-durable` currently maintains its own `Store` class (YAML file-backed with file locking). Once `DocumentStore::FileSystem` exists, durable can drop its custom storage layer and delegate to it — keeping only its durable-specific concerns: `Entry` (confidence scoring), `Reflector` (session reflection), `Learning` (Robot mixin), and the LLM tools.
|
|
29
|
+
|
|
30
|
+
## Interface Contract
|
|
31
|
+
|
|
32
|
+
All backends must implement:
|
|
33
|
+
|
|
34
|
+
```ruby
|
|
35
|
+
store(key, text) # embed and persist a document under key
|
|
36
|
+
search(query, limit:) # return Array<Hash> sorted by score descending
|
|
37
|
+
# each Hash: { key:, text:, score: }
|
|
38
|
+
delete(key) # remove document by key; return self
|
|
39
|
+
clear # remove all documents; return self
|
|
40
|
+
size # Integer
|
|
41
|
+
keys # Array<Symbol>
|
|
42
|
+
empty? # Boolean
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
## Open Questions
|
|
46
|
+
|
|
47
|
+
1. **Search semantics differ across backends.**
|
|
48
|
+
`Memory` uses embedding-based cosine similarity. `FileSystem` would use keyword matching (tokenize + stem, like durable's current approach). Should the interface treat these as equivalent, or should backends declare their search capability? Options:
|
|
49
|
+
- Accept the difference — callers get whatever the backend can do.
|
|
50
|
+
- Add a `#search_strategy` or `#semantic?` predicate to the base class.
|
|
51
|
+
|
|
52
|
+
2. **Structured vs raw text storage.**
|
|
53
|
+
`DocumentStore` today stores raw text by key. `robot_lab-durable` stores structured `Entry` objects (confidence, category, domain, use_count). Two options:
|
|
54
|
+
- `FileSystem` stores raw text; durable serializes/deserializes `Entry` fields into the text before storing.
|
|
55
|
+
- `FileSystem` supports structured metadata alongside text (a `meta:` hash), which durable populates.
|
|
56
|
+
|
|
57
|
+
3. **Breaking change.** Refactoring `DocumentStore` from a concrete class to an abstract base is a breaking change — this is v0.2.0 territory.
|
|
58
|
+
|
|
59
|
+
## Rough Implementation Plan
|
|
60
|
+
|
|
61
|
+
1. Extract the current `DocumentStore` implementation into `DocumentStore::Memory`.
|
|
62
|
+
2. Define `DocumentStore` as an abstract base class with `NotImplementedError` stubs for each interface method.
|
|
63
|
+
3. Implement `DocumentStore::FileSystem` — port durable's `Store` (YAML, file locking, keyword search).
|
|
64
|
+
4. Update `robot_lab-durable` gemspec to add `robot_lab-document_store` as a dependency.
|
|
65
|
+
5. Replace `RobotLab::Durable::Store` with `DocumentStore::FileSystem` in durable's internals.
|
|
66
|
+
6. Bump `robot_lab-document_store` to v0.2.0; bump `robot_lab-durable` to v0.2.0.
|
|
@@ -0,0 +1,198 @@
|
|
|
1
|
+
# RAG Patterns
|
|
2
|
+
|
|
3
|
+
Retrieval-Augmented Generation (RAG) is the practice of retrieving relevant
|
|
4
|
+
documents at query time and injecting them into an LLM prompt as context. This
|
|
5
|
+
gives the model access to private or up-to-date information without fine-tuning.
|
|
6
|
+
|
|
7
|
+
`DocumentStore` is the retrieval layer. The LLM call is your responsibility —
|
|
8
|
+
typically via `RobotLab` robots or the `ruby_llm` gem.
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## Pattern 1 — Standalone Retrieval
|
|
13
|
+
|
|
14
|
+
The simplest pattern: build the store at startup, query it per-request.
|
|
15
|
+
|
|
16
|
+
```ruby
|
|
17
|
+
require "robot_lab/document_store"
|
|
18
|
+
|
|
19
|
+
# ── Build once at startup ────────────────────────────────────────────────────
|
|
20
|
+
store = RobotLab::DocumentStore.new
|
|
21
|
+
|
|
22
|
+
Dir["docs/**/*.md"].each do |path|
|
|
23
|
+
key = File.basename(path, ".md").to_sym
|
|
24
|
+
store.store(key, File.read(path))
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
puts "Loaded #{store.size} documents"
|
|
28
|
+
|
|
29
|
+
# ── Query per-request ────────────────────────────────────────────────────────
|
|
30
|
+
query = "How do I investigate a slow Postgres query?"
|
|
31
|
+
hits = store.search(query, limit: 3)
|
|
32
|
+
|
|
33
|
+
context = hits.map.with_index(1) do |h, i|
|
|
34
|
+
"## Document #{i}: #{h[:key]}\n\n#{h[:text]}"
|
|
35
|
+
end.join("\n\n---\n\n")
|
|
36
|
+
|
|
37
|
+
prompt = <<~PROMPT
|
|
38
|
+
Answer the following question using only the provided documents.
|
|
39
|
+
|
|
40
|
+
#{context}
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
Question: #{query}
|
|
45
|
+
PROMPT
|
|
46
|
+
|
|
47
|
+
# Pass `prompt` to your LLM of choice
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
## Pattern 2 — RobotLab Memory Integration
|
|
53
|
+
|
|
54
|
+
When `robot_lab` core is loaded alongside this gem, `RobotLab::Memory` gains
|
|
55
|
+
three document-store methods automatically via `register_extension`.
|
|
56
|
+
|
|
57
|
+
```ruby
|
|
58
|
+
require "robot_lab"
|
|
59
|
+
require "robot_lab/document_store"
|
|
60
|
+
|
|
61
|
+
memory = RobotLab::Memory.new
|
|
62
|
+
|
|
63
|
+
# Store documents
|
|
64
|
+
memory.store_document(:runbook, File.read("ops/runbook.md"))
|
|
65
|
+
memory.store_document(:postmortem, File.read("ops/postmortem.md"))
|
|
66
|
+
|
|
67
|
+
# List keys
|
|
68
|
+
memory.document_keys # => [:runbook, :postmortem]
|
|
69
|
+
|
|
70
|
+
# Search
|
|
71
|
+
hits = memory.search_documents("database outage lock contention", limit: 2)
|
|
72
|
+
hits.each { |h| puts "#{h[:key]} #{h[:score].round(3)}" }
|
|
73
|
+
|
|
74
|
+
# Remove
|
|
75
|
+
memory.delete_document(:postmortem)
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
!!! note
|
|
79
|
+
`Memory#store_document` / `#search_documents` / `#delete_document` /
|
|
80
|
+
`#document_keys` are only available when `robot_lab-document_store` is loaded
|
|
81
|
+
before calling these methods.
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## Pattern 3 — RobotLab Robot with RAG Context
|
|
86
|
+
|
|
87
|
+
Inject retrieved context directly into a robot's prompt:
|
|
88
|
+
|
|
89
|
+
```ruby
|
|
90
|
+
require "robot_lab"
|
|
91
|
+
require "robot_lab/document_store"
|
|
92
|
+
|
|
93
|
+
# ── Prepare store ────────────────────────────────────────────────────────────
|
|
94
|
+
store = RobotLab::DocumentStore.new
|
|
95
|
+
store.store(:api_guide, File.read("docs/api.md"))
|
|
96
|
+
store.store(:error_codes, File.read("docs/errors.md"))
|
|
97
|
+
store.store(:changelog, File.read("CHANGELOG.md"))
|
|
98
|
+
|
|
99
|
+
# ── Build robot ──────────────────────────────────────────────────────────────
|
|
100
|
+
robot = RobotLab.build(
|
|
101
|
+
name: "support_agent",
|
|
102
|
+
system_prompt: "You are a helpful support agent. Answer questions using only the context provided."
|
|
103
|
+
)
|
|
104
|
+
|
|
105
|
+
# ── Per-request RAG ──────────────────────────────────────────────────────────
|
|
106
|
+
def answer(robot, store, question)
|
|
107
|
+
hits = store.search(question, limit: 3)
|
|
108
|
+
context = hits.map { |h| h[:text] }.join("\n\n---\n\n")
|
|
109
|
+
|
|
110
|
+
robot.run(<<~PROMPT)
|
|
111
|
+
Context documents:
|
|
112
|
+
|
|
113
|
+
#{context}
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
User question: #{question}
|
|
118
|
+
PROMPT
|
|
119
|
+
end
|
|
120
|
+
|
|
121
|
+
result = answer(robot, store, "What changed in the last release?")
|
|
122
|
+
puts result.last_text_content
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## Pattern 4 — Chunked Documents
|
|
128
|
+
|
|
129
|
+
For long documents, split into chunks before storing. Smaller chunks improve
|
|
130
|
+
retrieval precision because a single chunk covers a narrower topic.
|
|
131
|
+
|
|
132
|
+
```ruby
|
|
133
|
+
# Simple paragraph chunker
|
|
134
|
+
def chunk(text, max_words: 150)
|
|
135
|
+
paragraphs = text.split(/\n{2,}/).map(&:strip).reject(&:empty?)
|
|
136
|
+
chunks = []
|
|
137
|
+
buffer = []
|
|
138
|
+
word_count = 0
|
|
139
|
+
|
|
140
|
+
paragraphs.each do |para|
|
|
141
|
+
words = para.split.size
|
|
142
|
+
if word_count + words > max_words && buffer.any?
|
|
143
|
+
chunks << buffer.join("\n\n")
|
|
144
|
+
buffer = []
|
|
145
|
+
word_count = 0
|
|
146
|
+
end
|
|
147
|
+
buffer << para
|
|
148
|
+
word_count += words
|
|
149
|
+
end
|
|
150
|
+
chunks << buffer.join("\n\n") if buffer.any?
|
|
151
|
+
chunks
|
|
152
|
+
end
|
|
153
|
+
|
|
154
|
+
# Store with indexed chunk keys
|
|
155
|
+
store = RobotLab::DocumentStore.new
|
|
156
|
+
|
|
157
|
+
chunk("docs/runbook.md").each_with_index do |text, i|
|
|
158
|
+
store.store(:"runbook_#{i}", text)
|
|
159
|
+
end
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
---
|
|
163
|
+
|
|
164
|
+
## Pattern 5 — Hybrid Key Filtering
|
|
165
|
+
|
|
166
|
+
Use `search` results alongside `keys` to build filtered views or verify coverage:
|
|
167
|
+
|
|
168
|
+
```ruby
|
|
169
|
+
# Find which documents were never retrieved (potential gaps in coverage)
|
|
170
|
+
all_keys = store.keys.to_set
|
|
171
|
+
retrieved = questions.flat_map { |q| store.search(q, limit: 3).map { |r| r[:key] } }.to_set
|
|
172
|
+
never_hit = all_keys - retrieved
|
|
173
|
+
|
|
174
|
+
puts "Documents never retrieved: #{never_hit.to_a}"
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## Tips
|
|
180
|
+
|
|
181
|
+
**Chunk size matters.** Chunks of 100–250 words typically give the best
|
|
182
|
+
recall/precision trade-off. Very long documents dilute the embedding signal;
|
|
183
|
+
very short chunks lose context.
|
|
184
|
+
|
|
185
|
+
**Limit and threshold.** Retrieve more than you need (`limit: 5`) then drop
|
|
186
|
+
results below a quality threshold (e.g., `score >= 0.4`) before building the
|
|
187
|
+
context string. This avoids injecting unrelated documents.
|
|
188
|
+
|
|
189
|
+
```ruby
|
|
190
|
+
hits = store.search(query, limit: 5).select { |r| r[:score] >= 0.4 }
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
**Re-embed on document update.** `store` replaces an existing key — calling it
|
|
194
|
+
again with updated text re-embeds and replaces the stored vector automatically.
|
|
195
|
+
|
|
196
|
+
**Persistent corpus.** `DocumentStore` is in-memory only. For a persistent
|
|
197
|
+
corpus, re-load documents from disk at startup. For production use cases that
|
|
198
|
+
need persistence, consider a dedicated vector database (pgvector, Qdrant, Weaviate).
|
|
@@ -1,23 +1,27 @@
|
|
|
1
1
|
#!/usr/bin/env ruby
|
|
2
2
|
# frozen_string_literal: true
|
|
3
3
|
|
|
4
|
-
#
|
|
4
|
+
# Embedding-Based Document Store Demo
|
|
5
5
|
#
|
|
6
|
-
# Demonstrates
|
|
7
|
-
# lightweight RAG store backed by fastembed
|
|
6
|
+
# Demonstrates RobotLab::DocumentStore standalone and via Memory#store_document /
|
|
7
|
+
# Memory#search_documents — a lightweight RAG store backed by fastembed
|
|
8
|
+
# (BAAI/bge-small-en-v1.5).
|
|
8
9
|
#
|
|
9
10
|
# Documents are multi-paragraph engineering guides stored as Markdown files in:
|
|
10
|
-
# examples/
|
|
11
|
+
# examples/01_basic_usage/
|
|
11
12
|
#
|
|
12
13
|
# Usage:
|
|
13
|
-
# ruby examples/
|
|
14
|
+
# ruby examples/01_basic_usage.rb
|
|
15
|
+
# ./examples/01_basic_usage.rb
|
|
14
16
|
# (Downloads the ~23 MB ONNX model on first run; cached afterwards.)
|
|
15
17
|
|
|
16
|
-
|
|
17
|
-
|
|
18
|
+
$LOAD_PATH.unshift File.expand_path('../lib', __dir__)
|
|
19
|
+
|
|
20
|
+
require 'robot_lab'
|
|
21
|
+
require 'robot_lab/document_store'
|
|
18
22
|
|
|
19
23
|
puts "=" * 60
|
|
20
|
-
puts "
|
|
24
|
+
puts "Embedding-Based Document Store Demo"
|
|
21
25
|
puts "=" * 60
|
|
22
26
|
puts
|
|
23
27
|
puts "Note: First run downloads the fastembed model (~23 MB, cached)."
|
|
@@ -26,7 +30,7 @@ puts
|
|
|
26
30
|
# ---------------------------------------------------------------------------
|
|
27
31
|
# Load documents from the companion directory
|
|
28
32
|
# ---------------------------------------------------------------------------
|
|
29
|
-
DOC_DIR = File.join(__dir__,
|
|
33
|
+
DOC_DIR = File.join(__dir__, '01_basic_usage')
|
|
30
34
|
|
|
31
35
|
DOCUMENTS = Dir[File.join(DOC_DIR, "*.md")].sort.each_with_object({}) do |path, h|
|
|
32
36
|
key = File.basename(path, ".md").to_sym
|