rakam-systems-vectorstore 0.1.1rc7__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- rakam_systems_vectorstore/MANIFEST.in +26 -0
- rakam_systems_vectorstore/README.md +1071 -0
- rakam_systems_vectorstore/__init__.py +93 -0
- rakam_systems_vectorstore/components/__init__.py +0 -0
- rakam_systems_vectorstore/components/chunker/__init__.py +19 -0
- rakam_systems_vectorstore/components/chunker/advanced_chunker.py +1019 -0
- rakam_systems_vectorstore/components/chunker/text_chunker.py +154 -0
- rakam_systems_vectorstore/components/embedding_model/__init__.py +0 -0
- rakam_systems_vectorstore/components/embedding_model/configurable_embeddings.py +546 -0
- rakam_systems_vectorstore/components/embedding_model/openai_embeddings.py +259 -0
- rakam_systems_vectorstore/components/loader/__init__.py +31 -0
- rakam_systems_vectorstore/components/loader/adaptive_loader.py +512 -0
- rakam_systems_vectorstore/components/loader/code_loader.py +699 -0
- rakam_systems_vectorstore/components/loader/doc_loader.py +812 -0
- rakam_systems_vectorstore/components/loader/eml_loader.py +556 -0
- rakam_systems_vectorstore/components/loader/html_loader.py +626 -0
- rakam_systems_vectorstore/components/loader/md_loader.py +622 -0
- rakam_systems_vectorstore/components/loader/odt_loader.py +750 -0
- rakam_systems_vectorstore/components/loader/pdf_loader.py +771 -0
- rakam_systems_vectorstore/components/loader/pdf_loader_light.py +723 -0
- rakam_systems_vectorstore/components/loader/tabular_loader.py +597 -0
- rakam_systems_vectorstore/components/vectorstore/__init__.py +0 -0
- rakam_systems_vectorstore/components/vectorstore/apps.py +10 -0
- rakam_systems_vectorstore/components/vectorstore/configurable_pg_vector_store.py +1661 -0
- rakam_systems_vectorstore/components/vectorstore/faiss_vector_store.py +878 -0
- rakam_systems_vectorstore/components/vectorstore/migrations/0001_initial.py +55 -0
- rakam_systems_vectorstore/components/vectorstore/migrations/__init__.py +0 -0
- rakam_systems_vectorstore/components/vectorstore/models.py +10 -0
- rakam_systems_vectorstore/components/vectorstore/pg_models.py +97 -0
- rakam_systems_vectorstore/components/vectorstore/pg_vector_store.py +827 -0
- rakam_systems_vectorstore/config.py +266 -0
- rakam_systems_vectorstore/core.py +8 -0
- rakam_systems_vectorstore/pyproject.toml +113 -0
- rakam_systems_vectorstore/server/README.md +290 -0
- rakam_systems_vectorstore/server/__init__.py +20 -0
- rakam_systems_vectorstore/server/mcp_server_vector.py +325 -0
- rakam_systems_vectorstore/setup.py +103 -0
- rakam_systems_vectorstore-0.1.1rc7.dist-info/METADATA +370 -0
- rakam_systems_vectorstore-0.1.1rc7.dist-info/RECORD +40 -0
- rakam_systems_vectorstore-0.1.1rc7.dist-info/WHEEL +4 -0
|
@@ -0,0 +1,1071 @@
|
|
|
1
|
+
# AI VectorStore
|
|
2
|
+
|
|
3
|
+
A modular, production-ready vector store system for semantic search and retrieval-augmented generation (RAG) applications. Part of the Rakam Systems AI framework.
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
The `ai_vectorstore` package provides a comprehensive, production-ready set of components for building vector-based search and retrieval systems. It supports multiple backend implementations (PostgreSQL with pgvector, FAISS) and includes all necessary components for a complete RAG pipeline.
|
|
8
|
+
|
|
9
|
+
### What's New
|
|
10
|
+
|
|
11
|
+
- ✨ **ConfigurablePgVectorStore**: Enhanced vector store with full YAML/JSON configuration support
|
|
12
|
+
- ⚙️ **Configuration System**: Centralized configuration management with validation and environment variable support
|
|
13
|
+
- 🔄 **Update Operations**: Update existing vectors, embeddings, and metadata without re-indexing
|
|
14
|
+
- 🔌 **Pluggable Embeddings**: Support for multiple embedding providers (SentenceTransformers, OpenAI, Cohere)
|
|
15
|
+
- 📊 **Enhanced Search**: Multiple similarity metrics (cosine, L2, dot product) with configurable hybrid search
|
|
16
|
+
- 🎯 **Better DX**: Improved developer experience with clearer APIs and comprehensive documentation
|
|
17
|
+
|
|
18
|
+
### Quick Links
|
|
19
|
+
|
|
20
|
+
- [Installation](#installation)
|
|
21
|
+
- [Quick Start](#quick-start)
|
|
22
|
+
- [Configurable Vector Store (Recommended)](#configurable-vector-store-recommended)
|
|
23
|
+
- [Configuration](#configuration)
|
|
24
|
+
- [API Reference](#api-reference)
|
|
25
|
+
- [Migration Guide](#migration-guide)
|
|
26
|
+
|
|
27
|
+
## Features
|
|
28
|
+
|
|
29
|
+
- **Multiple Vector Store Backends**
|
|
30
|
+
- PostgreSQL with pgvector (persistent, production-ready)
|
|
31
|
+
- FAISS (in-memory, high-performance)
|
|
32
|
+
- Configurable vector store with YAML/JSON configuration support
|
|
33
|
+
|
|
34
|
+
- **Hybrid Search Capabilities**
|
|
35
|
+
- Vector similarity search (cosine, L2, dot product)
|
|
36
|
+
- Full-text search (PostgreSQL)
|
|
37
|
+
- Combined hybrid search with configurable weighting
|
|
38
|
+
- Adjustable alpha parameter for search balance
|
|
39
|
+
|
|
40
|
+
- **Advanced Retrieval**
|
|
41
|
+
- Built-in re-ranking for improved relevance
|
|
42
|
+
- Metadata filtering with Django ORM support
|
|
43
|
+
- Collection-based organization
|
|
44
|
+
- LRU caching for query performance optimization
|
|
45
|
+
- Update operations for existing vectors
|
|
46
|
+
|
|
47
|
+
- **Flexible Embedding Options**
|
|
48
|
+
- Local embedding models (SentenceTransformers)
|
|
49
|
+
- OpenAI API integration
|
|
50
|
+
- Cohere API support
|
|
51
|
+
- Configurable embedding dimensions
|
|
52
|
+
- Pluggable embedding backends
|
|
53
|
+
|
|
54
|
+
- **Complete RAG Pipeline Components**
|
|
55
|
+
- Document loaders (file, adaptive)
|
|
56
|
+
- Text chunkers (simple, text-based)
|
|
57
|
+
- Embedding models (configurable, OpenAI)
|
|
58
|
+
- Indexers (simple indexer)
|
|
59
|
+
- Retrievers (basic retriever)
|
|
60
|
+
- Re-rankers (model-based)
|
|
61
|
+
|
|
62
|
+
- **Configuration Management**
|
|
63
|
+
- YAML/JSON configuration files
|
|
64
|
+
- Environment variable overrides
|
|
65
|
+
- Programmatic configuration
|
|
66
|
+
- Configuration validation and defaults
|
|
67
|
+
|
|
68
|
+
## Architecture
|
|
69
|
+
|
|
70
|
+
The package follows a modular architecture with clear interfaces:
|
|
71
|
+
|
|
72
|
+
```
|
|
73
|
+
ai_vectorstore/
|
|
74
|
+
├── core.py # Core data structures (VSFile, Node, NodeMetadata)
|
|
75
|
+
├── config.py # Configuration management system
|
|
76
|
+
├── components/
|
|
77
|
+
│ ├── vectorstore/ # Vector store implementations
|
|
78
|
+
│ │ ├── pg_vector_store.py # PostgreSQL backend
|
|
79
|
+
│ │ ├── configurable_pg_vector_store.py # Enhanced configurable backend
|
|
80
|
+
│ │ ├── faiss_vector_store.py # FAISS backend
|
|
81
|
+
│ │ ├── pg_models.py # Django ORM models
|
|
82
|
+
│ │ └── migrations/ # Database migrations
|
|
83
|
+
│ ├── chunker/ # Text chunking components
|
|
84
|
+
│ │ ├── simple_chunker.py # Basic text chunker
|
|
85
|
+
│ │ └── text_chunker.py # Advanced text chunker
|
|
86
|
+
│ ├── embedding_model/ # Embedding generation
|
|
87
|
+
│ │ ├── configurable_embeddings.py # Pluggable embedding system
|
|
88
|
+
│ │ └── openai_embeddings.py # OpenAI API embeddings
|
|
89
|
+
│ ├── indexer/ # Document indexing
|
|
90
|
+
│ │ └── simple_indexer.py # Basic indexer
|
|
91
|
+
│ ├── loader/ # Document loading
|
|
92
|
+
│ │ ├── file_loader.py # File loading
|
|
93
|
+
│ │ └── adaptive_loader.py # Adaptive loading
|
|
94
|
+
│ ├── reranker/ # Result re-ranking
|
|
95
|
+
│ │ └── model_reranker.py # Model-based reranker
|
|
96
|
+
│ └── retriever/ # Search and retrieval
|
|
97
|
+
│ └── basic_retriever.py # Basic retriever
|
|
98
|
+
└── server/ # MCP server implementation
|
|
99
|
+
└── mcp_server_vector.py
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
## Installation
|
|
103
|
+
|
|
104
|
+
### As Part of Rakam Systems (Recommended)
|
|
105
|
+
|
|
106
|
+
```bash
|
|
107
|
+
cd app/rakam_systems
|
|
108
|
+
|
|
109
|
+
# Full AI Vectorstore with all features
|
|
110
|
+
pip install -e ".[ai-vectorstore]"
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
### Standalone Installation
|
|
114
|
+
|
|
115
|
+
For granular control over dependencies:
|
|
116
|
+
|
|
117
|
+
```bash
|
|
118
|
+
cd app/rakam_systems/rakam_systems/ai_vectorstore
|
|
119
|
+
|
|
120
|
+
# Core only (minimal)
|
|
121
|
+
pip install -e .
|
|
122
|
+
|
|
123
|
+
# With PostgreSQL backend
|
|
124
|
+
pip install -e ".[postgres]"
|
|
125
|
+
|
|
126
|
+
# With FAISS backend
|
|
127
|
+
pip install -e ".[faiss]"
|
|
128
|
+
|
|
129
|
+
# With local embeddings (SentenceTransformers)
|
|
130
|
+
pip install -e ".[local-embeddings]"
|
|
131
|
+
|
|
132
|
+
# With OpenAI or Cohere embeddings
|
|
133
|
+
pip install -e ".[openai]" # or ".[cohere]"
|
|
134
|
+
|
|
135
|
+
# With document loaders
|
|
136
|
+
pip install -e ".[loaders]"
|
|
137
|
+
|
|
138
|
+
# Everything
|
|
139
|
+
pip install -e ".[all]"
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
**📖 See [INSTALLATION.md](INSTALLATION.md) for complete installation guide** | **⚡ [QUICK_INSTALL.md](QUICK_INSTALL.md) for quick reference**
|
|
143
|
+
|
|
144
|
+
## Quick Start
|
|
145
|
+
|
|
146
|
+
### PostgreSQL Vector Store
|
|
147
|
+
|
|
148
|
+
```python
|
|
149
|
+
from ai_vectorstore.components.vectorstore.pg_vector_store import PgVectorStore
|
|
150
|
+
from ai_vectorstore.core import Node, NodeMetadata, VSFile
|
|
151
|
+
|
|
152
|
+
# Initialize the vector store
|
|
153
|
+
vector_store = PgVectorStore(
|
|
154
|
+
name="my_vector_store",
|
|
155
|
+
embedding_model="Snowflake/snowflake-arctic-embed-m",
|
|
156
|
+
use_embedding_api=False # Use local model
|
|
157
|
+
)
|
|
158
|
+
|
|
159
|
+
# Create a collection
|
|
160
|
+
collection_name = "documents"
|
|
161
|
+
vector_store.create_collection(collection_name)
|
|
162
|
+
|
|
163
|
+
# Add documents
|
|
164
|
+
nodes = [
|
|
165
|
+
Node(
|
|
166
|
+
content="Your document content here",
|
|
167
|
+
metadata=NodeMetadata(
|
|
168
|
+
source_file_uuid="file-uuid",
|
|
169
|
+
position=0,
|
|
170
|
+
custom={"title": "Document Title"}
|
|
171
|
+
)
|
|
172
|
+
)
|
|
173
|
+
]
|
|
174
|
+
vector_store.add_nodes(nodes, collection_name)
|
|
175
|
+
|
|
176
|
+
# Search
|
|
177
|
+
results = vector_store.search(
|
|
178
|
+
query="Your search query",
|
|
179
|
+
collection_name=collection_name,
|
|
180
|
+
top_k=5,
|
|
181
|
+
search_type="hybrid" # or "vector", "fts"
|
|
182
|
+
)
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### FAISS Vector Store
|
|
186
|
+
|
|
187
|
+
```python
|
|
188
|
+
from ai_vectorstore.components.vectorstore.faiss_vector_store import FaissStore
|
|
189
|
+
from ai_vectorstore.core import Node, NodeMetadata
|
|
190
|
+
|
|
191
|
+
# Initialize FAISS store
|
|
192
|
+
faiss_store = FaissStore(
|
|
193
|
+
name="my_faiss_store",
|
|
194
|
+
base_index_path="./faiss_indexes",
|
|
195
|
+
embedding_model="Snowflake/snowflake-arctic-embed-m"
|
|
196
|
+
)
|
|
197
|
+
|
|
198
|
+
# Create collection and add nodes
|
|
199
|
+
collection_name = "documents"
|
|
200
|
+
faiss_store.create_collection(collection_name)
|
|
201
|
+
|
|
202
|
+
nodes = [
|
|
203
|
+
Node(
|
|
204
|
+
content="Your content here",
|
|
205
|
+
metadata=NodeMetadata(
|
|
206
|
+
source_file_uuid="file-uuid",
|
|
207
|
+
position=0
|
|
208
|
+
)
|
|
209
|
+
)
|
|
210
|
+
]
|
|
211
|
+
faiss_store.add_nodes(nodes, collection_name)
|
|
212
|
+
|
|
213
|
+
# Search
|
|
214
|
+
results = faiss_store.query_nodes(
|
|
215
|
+
query="search query",
|
|
216
|
+
collection_name=collection_name,
|
|
217
|
+
top_k=5
|
|
218
|
+
)
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
## Configurable Vector Store (Recommended)
|
|
222
|
+
|
|
223
|
+
The `ConfigurablePgVectorStore` is an enhanced, production-ready vector store that supports configuration via YAML/JSON files or dictionaries. This is the recommended approach for production deployments.
|
|
224
|
+
|
|
225
|
+
### Configuration System
|
|
226
|
+
|
|
227
|
+
The configuration system provides a unified way to manage all vector store settings:
|
|
228
|
+
|
|
229
|
+
```python
|
|
230
|
+
from ai_vectorstore.config import VectorStoreConfig, EmbeddingConfig, SearchConfig
|
|
231
|
+
from ai_vectorstore.components.vectorstore.configurable_pg_vector_store import ConfigurablePgVectorStore
|
|
232
|
+
|
|
233
|
+
# Option 1: Use defaults
|
|
234
|
+
vector_store = ConfigurablePgVectorStore()
|
|
235
|
+
|
|
236
|
+
# Option 2: Load from YAML file
|
|
237
|
+
vector_store = ConfigurablePgVectorStore(
|
|
238
|
+
name="my_store",
|
|
239
|
+
config="/path/to/config.yaml"
|
|
240
|
+
)
|
|
241
|
+
|
|
242
|
+
# Option 3: Programmatic configuration
|
|
243
|
+
config = VectorStoreConfig(
|
|
244
|
+
name="custom_store",
|
|
245
|
+
embedding=EmbeddingConfig(
|
|
246
|
+
model_type="sentence_transformer",
|
|
247
|
+
model_name="Snowflake/snowflake-arctic-embed-m",
|
|
248
|
+
batch_size=32
|
|
249
|
+
),
|
|
250
|
+
search=SearchConfig(
|
|
251
|
+
similarity_metric="cosine",
|
|
252
|
+
default_top_k=5,
|
|
253
|
+
enable_hybrid_search=True,
|
|
254
|
+
hybrid_alpha=0.7,
|
|
255
|
+
rerank=True
|
|
256
|
+
)
|
|
257
|
+
)
|
|
258
|
+
vector_store = ConfigurablePgVectorStore(name="my_store", config=config)
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
### YAML Configuration Example
|
|
262
|
+
|
|
263
|
+
Create a `vectorstore_config.yaml` file:
|
|
264
|
+
|
|
265
|
+
```yaml
|
|
266
|
+
name: production_vectorstore
|
|
267
|
+
|
|
268
|
+
embedding:
|
|
269
|
+
model_type: sentence_transformer
|
|
270
|
+
model_name: Snowflake/snowflake-arctic-embed-m
|
|
271
|
+
batch_size: 32
|
|
272
|
+
normalize: true
|
|
273
|
+
|
|
274
|
+
database:
|
|
275
|
+
host: localhost
|
|
276
|
+
port: 5432
|
|
277
|
+
database: vectorstore_db
|
|
278
|
+
user: postgres
|
|
279
|
+
password: postgres
|
|
280
|
+
|
|
281
|
+
search:
|
|
282
|
+
similarity_metric: cosine
|
|
283
|
+
default_top_k: 5
|
|
284
|
+
enable_hybrid_search: true
|
|
285
|
+
hybrid_alpha: 0.7
|
|
286
|
+
rerank: true
|
|
287
|
+
search_buffer_factor: 2
|
|
288
|
+
|
|
289
|
+
index:
|
|
290
|
+
chunk_size: 512
|
|
291
|
+
chunk_overlap: 50
|
|
292
|
+
batch_insert_size: 100
|
|
293
|
+
|
|
294
|
+
enable_caching: true
|
|
295
|
+
cache_size: 1000
|
|
296
|
+
enable_logging: true
|
|
297
|
+
log_level: INFO
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
### Using ConfigurablePgVectorStore
|
|
301
|
+
|
|
302
|
+
```python
|
|
303
|
+
from ai_vectorstore.components.vectorstore.configurable_pg_vector_store import ConfigurablePgVectorStore
|
|
304
|
+
from ai_vectorstore.core import Node, NodeMetadata
|
|
305
|
+
|
|
306
|
+
# Initialize with config
|
|
307
|
+
store = ConfigurablePgVectorStore(
|
|
308
|
+
name="my_vectorstore",
|
|
309
|
+
config="config.yaml"
|
|
310
|
+
)
|
|
311
|
+
store.setup()
|
|
312
|
+
|
|
313
|
+
# Create collection
|
|
314
|
+
store.get_or_create_collection("documents")
|
|
315
|
+
|
|
316
|
+
# Add nodes
|
|
317
|
+
nodes = [
|
|
318
|
+
Node(
|
|
319
|
+
content="Document content here",
|
|
320
|
+
metadata=NodeMetadata(
|
|
321
|
+
source_file_uuid="file-123",
|
|
322
|
+
position=0,
|
|
323
|
+
custom={"title": "Document Title"}
|
|
324
|
+
)
|
|
325
|
+
)
|
|
326
|
+
]
|
|
327
|
+
store.add_nodes("documents", nodes)
|
|
328
|
+
|
|
329
|
+
# Search with configuration defaults
|
|
330
|
+
results, nodes = store.search(
|
|
331
|
+
collection_name="documents",
|
|
332
|
+
query="search query"
|
|
333
|
+
)
|
|
334
|
+
|
|
335
|
+
# Override configuration at search time
|
|
336
|
+
results, nodes = store.search(
|
|
337
|
+
collection_name="documents",
|
|
338
|
+
query="search query",
|
|
339
|
+
distance_type="l2",
|
|
340
|
+
number=10,
|
|
341
|
+
hybrid_search=False
|
|
342
|
+
)
|
|
343
|
+
|
|
344
|
+
# Update existing vectors
|
|
345
|
+
store.update_vector(
|
|
346
|
+
collection_name="documents",
|
|
347
|
+
node_id=1,
|
|
348
|
+
new_content="Updated content", # Will regenerate embedding
|
|
349
|
+
new_metadata={"status": "reviewed"}
|
|
350
|
+
)
|
|
351
|
+
|
|
352
|
+
# Delete specific nodes
|
|
353
|
+
store.delete_nodes("documents", [1, 2, 3])
|
|
354
|
+
|
|
355
|
+
# Get collection information
|
|
356
|
+
info = store.get_collection_info("documents")
|
|
357
|
+
print(f"Collection has {info['node_count']} nodes")
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
### Configuration Classes
|
|
361
|
+
|
|
362
|
+
**EmbeddingConfig** - Embedding model configuration:
|
|
363
|
+
- `model_type`: "sentence_transformer", "openai", "cohere"
|
|
364
|
+
- `model_name`: Model identifier
|
|
365
|
+
- `api_key`: API key (auto-loaded from environment)
|
|
366
|
+
- `batch_size`: Batch size for embeddings
|
|
367
|
+
- `normalize`: Normalize embeddings
|
|
368
|
+
- `dimensions`: Embedding dimensions (auto-detected)
|
|
369
|
+
|
|
370
|
+
**DatabaseConfig** - Database connection settings:
|
|
371
|
+
- `host`, `port`, `database`, `user`, `password`
|
|
372
|
+
- `pool_size`, `max_overflow`: Connection pooling
|
|
373
|
+
- Auto-loads from `POSTGRES_*` environment variables
|
|
374
|
+
|
|
375
|
+
**SearchConfig** - Search behavior configuration:
|
|
376
|
+
- `similarity_metric`: "cosine", "l2", "dot_product"
|
|
377
|
+
- `default_top_k`: Default number of results
|
|
378
|
+
- `enable_hybrid_search`: Enable hybrid search
|
|
379
|
+
- `hybrid_alpha`: Vector/keyword balance (0-1)
|
|
380
|
+
- `rerank`: Enable re-ranking
|
|
381
|
+
- `search_buffer_factor`: Buffer for re-ranking
|
|
382
|
+
|
|
383
|
+
**IndexConfig** - Indexing configuration:
|
|
384
|
+
- `chunk_size`, `chunk_overlap`: Chunking parameters
|
|
385
|
+
- `enable_parallel_processing`: Parallel processing
|
|
386
|
+
- `parallel_workers`: Number of workers
|
|
387
|
+
- `batch_insert_size`: Batch size for inserts
|
|
388
|
+
|
|
389
|
+
## Core Data Structures
|
|
390
|
+
|
|
391
|
+
### VSFile
|
|
392
|
+
Represents a source file to be processed:
|
|
393
|
+
```python
|
|
394
|
+
from ai_vectorstore.core import VSFile
|
|
395
|
+
|
|
396
|
+
file = VSFile(file_path="/path/to/document.pdf")
|
|
397
|
+
# Automatically extracts filename, MIME type, and UUID
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
### Node
|
|
401
|
+
Represents a chunk of content with metadata:
|
|
402
|
+
```python
|
|
403
|
+
from ai_vectorstore.core import Node, NodeMetadata
|
|
404
|
+
|
|
405
|
+
node = Node(
|
|
406
|
+
content="Document chunk text",
|
|
407
|
+
metadata=NodeMetadata(
|
|
408
|
+
source_file_uuid="file-uuid",
|
|
409
|
+
position=0, # page number or chunk position
|
|
410
|
+
custom={"author": "John Doe", "date": "2025-01-01"}
|
|
411
|
+
)
|
|
412
|
+
)
|
|
413
|
+
```
|
|
414
|
+
|
|
415
|
+
### NodeMetadata
|
|
416
|
+
Stores metadata about each node:
|
|
417
|
+
- `node_id`: Unique identifier (auto-assigned)
|
|
418
|
+
- `source_file_uuid`: Reference to source file
|
|
419
|
+
- `position`: Position in source (page number, chunk index, etc.)
|
|
420
|
+
- `custom`: Dictionary for arbitrary metadata
|
|
421
|
+
|
|
422
|
+
## PostgreSQL Backend
|
|
423
|
+
|
|
424
|
+
### Features
|
|
425
|
+
- ✅ Persistent storage with ACID transactions
|
|
426
|
+
- ✅ Hybrid search (vector + full-text)
|
|
427
|
+
- ✅ Built-in re-ranking with cross-encoder models
|
|
428
|
+
- ✅ Metadata filtering and custom queries
|
|
429
|
+
- ✅ Collection-based organization
|
|
430
|
+
- ✅ Query result caching (LRU)
|
|
431
|
+
- ✅ Django ORM integration
|
|
432
|
+
- ✅ Update operations (content, embeddings, metadata)
|
|
433
|
+
- ✅ Multiple similarity metrics (cosine, L2, dot product)
|
|
434
|
+
- ✅ Configuration-driven architecture
|
|
435
|
+
|
|
436
|
+
### Setup
|
|
437
|
+
|
|
438
|
+
1. **Start PostgreSQL with pgvector:**
|
|
439
|
+
```bash
|
|
440
|
+
docker run -d \
|
|
441
|
+
--name postgres-vectorstore \
|
|
442
|
+
-e POSTGRES_PASSWORD=postgres \
|
|
443
|
+
-e POSTGRES_DB=vectorstore_db \
|
|
444
|
+
-p 5432:5432 \
|
|
445
|
+
pgvector/pgvector:pg16
|
|
446
|
+
```
|
|
447
|
+
|
|
448
|
+
2. **Configure Django settings:**
|
|
449
|
+
```python
|
|
450
|
+
import os
|
|
451
|
+
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'your_app.settings')
|
|
452
|
+
|
|
453
|
+
# In your Django settings:
|
|
454
|
+
DATABASES = {
|
|
455
|
+
'default': {
|
|
456
|
+
'ENGINE': 'django.db.backends.postgresql',
|
|
457
|
+
'NAME': 'vectorstore_db',
|
|
458
|
+
'USER': 'postgres',
|
|
459
|
+
'PASSWORD': 'postgres',
|
|
460
|
+
'HOST': 'localhost',
|
|
461
|
+
'PORT': '5432',
|
|
462
|
+
}
|
|
463
|
+
}
|
|
464
|
+
|
|
465
|
+
INSTALLED_APPS = [
|
|
466
|
+
'ai_vectorstore.components.vectorstore',
|
|
467
|
+
# ... other apps
|
|
468
|
+
]
|
|
469
|
+
```
|
|
470
|
+
|
|
471
|
+
3. **Run migrations:**
|
|
472
|
+
```bash
|
|
473
|
+
python manage.py migrate
|
|
474
|
+
```
|
|
475
|
+
|
|
476
|
+
### Search Types
|
|
477
|
+
|
|
478
|
+
**Vector Search** - Pure semantic similarity:
|
|
479
|
+
```python
|
|
480
|
+
results = vector_store.search(
|
|
481
|
+
query="machine learning",
|
|
482
|
+
collection_name="docs",
|
|
483
|
+
search_type="vector",
|
|
484
|
+
top_k=5
|
|
485
|
+
)
|
|
486
|
+
```
|
|
487
|
+
|
|
488
|
+
**Full-Text Search** - Traditional keyword matching:
|
|
489
|
+
```python
|
|
490
|
+
results = vector_store.search(
|
|
491
|
+
query="exact phrase match",
|
|
492
|
+
collection_name="docs",
|
|
493
|
+
search_type="fts",
|
|
494
|
+
top_k=5
|
|
495
|
+
)
|
|
496
|
+
```
|
|
497
|
+
|
|
498
|
+
**Hybrid Search** - Combined approach (best results):
|
|
499
|
+
```python
|
|
500
|
+
results = vector_store.search(
|
|
501
|
+
query="machine learning algorithms",
|
|
502
|
+
collection_name="docs",
|
|
503
|
+
search_type="hybrid",
|
|
504
|
+
alpha=0.5, # 0=pure FTS, 1=pure vector
|
|
505
|
+
top_k=5,
|
|
506
|
+
rerank=True # Enable cross-encoder re-ranking
|
|
507
|
+
)
|
|
508
|
+
```
|
|
509
|
+
|
|
510
|
+
### Advanced Features
|
|
511
|
+
|
|
512
|
+
**Metadata Filtering:**
|
|
513
|
+
```python
|
|
514
|
+
# Custom filtering with Django ORM
|
|
515
|
+
from ai_vectorstore.components.vectorstore.pg_models import NodeEntry
|
|
516
|
+
|
|
517
|
+
filtered_results = vector_store.search(
|
|
518
|
+
query="query text",
|
|
519
|
+
collection_name="docs",
|
|
520
|
+
filter_func=lambda qs: qs.filter(
|
|
521
|
+
metadata__custom__author="John Doe"
|
|
522
|
+
)
|
|
523
|
+
)
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
**Batch Operations:**
|
|
527
|
+
```python
|
|
528
|
+
# Batch add nodes
|
|
529
|
+
vector_store.add_nodes_batch(
|
|
530
|
+
nodes_list=all_nodes,
|
|
531
|
+
collection_name="docs",
|
|
532
|
+
batch_size=100
|
|
533
|
+
)
|
|
534
|
+
```
|
|
535
|
+
|
|
536
|
+
**Collection Management:**
|
|
537
|
+
```python
|
|
538
|
+
# List collections
|
|
539
|
+
collections = vector_store.list_collections()
|
|
540
|
+
|
|
541
|
+
# Delete collection
|
|
542
|
+
vector_store.delete_collection("old_collection")
|
|
543
|
+
|
|
544
|
+
# Get collection stats (PgVectorStore)
|
|
545
|
+
stats = vector_store.get_collection_stats("docs")
|
|
546
|
+
|
|
547
|
+
# Get collection info (ConfigurablePgVectorStore)
|
|
548
|
+
info = vector_store.get_collection_info("docs")
|
|
549
|
+
print(f"Nodes: {info['node_count']}, Dimensions: {info['embedding_dim']}")
|
|
550
|
+
```
|
|
551
|
+
|
|
552
|
+
**Update Operations (ConfigurablePgVectorStore):**
|
|
553
|
+
```python
|
|
554
|
+
from ai_vectorstore.components.vectorstore.configurable_pg_vector_store import ConfigurablePgVectorStore
|
|
555
|
+
|
|
556
|
+
store = ConfigurablePgVectorStore(config="config.yaml")
|
|
557
|
+
|
|
558
|
+
# Update content (will regenerate embedding automatically)
|
|
559
|
+
store.update_vector(
|
|
560
|
+
collection_name="docs",
|
|
561
|
+
node_id=123,
|
|
562
|
+
new_content="Updated document content",
|
|
563
|
+
new_metadata={"status": "reviewed", "version": 2}
|
|
564
|
+
)
|
|
565
|
+
|
|
566
|
+
# Update only embedding (e.g., with improved model)
|
|
567
|
+
new_embedding = embedding_model.encode("document content")
|
|
568
|
+
store.update_vector(
|
|
569
|
+
collection_name="docs",
|
|
570
|
+
node_id=123,
|
|
571
|
+
new_embedding=new_embedding
|
|
572
|
+
)
|
|
573
|
+
|
|
574
|
+
# Update only metadata
|
|
575
|
+
store.update_vector(
|
|
576
|
+
collection_name="docs",
|
|
577
|
+
node_id=123,
|
|
578
|
+
new_metadata={"priority": "high"}
|
|
579
|
+
)
|
|
580
|
+
|
|
581
|
+
# Delete specific nodes
|
|
582
|
+
store.delete_nodes(collection_name="docs", node_ids=[123, 456, 789])
|
|
583
|
+
```
|
|
584
|
+
|
|
585
|
+
## FAISS Backend
|
|
586
|
+
|
|
587
|
+
### Features
|
|
588
|
+
- ✅ In-memory, extremely fast search
|
|
589
|
+
- ✅ No database required
|
|
590
|
+
- ✅ Persistent index saving/loading
|
|
591
|
+
- ✅ Collection-based organization
|
|
592
|
+
- ✅ Perfect for prototyping and development
|
|
593
|
+
|
|
594
|
+
### Usage
|
|
595
|
+
|
|
596
|
+
```python
|
|
597
|
+
from ai_vectorstore.components.vectorstore.faiss_vector_store import FaissStore
|
|
598
|
+
|
|
599
|
+
# Initialize
|
|
600
|
+
store = FaissStore(
|
|
601
|
+
base_index_path="./indexes",
|
|
602
|
+
embedding_model="Snowflake/snowflake-arctic-embed-m"
|
|
603
|
+
)
|
|
604
|
+
|
|
605
|
+
# Operations are similar to PostgreSQL backend
|
|
606
|
+
store.create_collection("docs")
|
|
607
|
+
store.add_nodes(nodes, "docs")
|
|
608
|
+
results = store.query_nodes("query", "docs", top_k=5)
|
|
609
|
+
|
|
610
|
+
# Save/load indexes
|
|
611
|
+
store.save_vector_store() # Auto-saves to base_index_path
|
|
612
|
+
store.load_vector_store() # Loads on init if not initialising=True
|
|
613
|
+
```
|
|
614
|
+
|
|
615
|
+
## Configuration
|
|
616
|
+
|
|
617
|
+
### Environment Variables
|
|
618
|
+
|
|
619
|
+
The configuration system automatically loads values from environment variables:
|
|
620
|
+
|
|
621
|
+
```bash
|
|
622
|
+
# Embedding API Keys
|
|
623
|
+
export OPENAI_API_KEY="your-openai-api-key"
|
|
624
|
+
export COHERE_API_KEY="your-cohere-api-key"
|
|
625
|
+
|
|
626
|
+
# PostgreSQL Connection (auto-loaded by DatabaseConfig)
|
|
627
|
+
export POSTGRES_DB="vectorstore_db"
|
|
628
|
+
export POSTGRES_USER="postgres"
|
|
629
|
+
export POSTGRES_PASSWORD="postgres"
|
|
630
|
+
export POSTGRES_HOST="localhost"
|
|
631
|
+
export POSTGRES_PORT="5432"
|
|
632
|
+
|
|
633
|
+
# Django Settings
|
|
634
|
+
export DJANGO_SETTINGS_MODULE="your_app.settings"
|
|
635
|
+
```
|
|
636
|
+
|
|
637
|
+
### Loading and Saving Configurations
|
|
638
|
+
|
|
639
|
+
```python
|
|
640
|
+
from ai_vectorstore.config import VectorStoreConfig, load_config
|
|
641
|
+
|
|
642
|
+
# Load from file (auto-detects format)
|
|
643
|
+
config = load_config("config.yaml") # or config.json
|
|
644
|
+
|
|
645
|
+
# Load from dictionary
|
|
646
|
+
config_dict = {
|
|
647
|
+
"name": "my_store",
|
|
648
|
+
"embedding": {"model_name": "custom-model"},
|
|
649
|
+
"search": {"default_top_k": 10}
|
|
650
|
+
}
|
|
651
|
+
config = load_config(config_dict)
|
|
652
|
+
|
|
653
|
+
# Validate configuration
|
|
654
|
+
config.validate()
|
|
655
|
+
|
|
656
|
+
# Save to file
|
|
657
|
+
config.save_yaml("output_config.yaml")
|
|
658
|
+
config.save_json("output_config.json")
|
|
659
|
+
|
|
660
|
+
# Convert to dictionary
|
|
661
|
+
config_dict = config.to_dict()
|
|
662
|
+
```
|
|
663
|
+
|
|
664
|
+
### Embedding Models
|
|
665
|
+
|
|
666
|
+
**Local Models (SentenceTransformers):**
|
|
667
|
+
- `Snowflake/snowflake-arctic-embed-m` (default, 768-dim, recommended)
|
|
668
|
+
- `sentence-transformers/all-MiniLM-L6-v2` (384-dim, fast)
|
|
669
|
+
- `BAAI/bge-large-en-v1.5` (1024-dim, high quality)
|
|
670
|
+
- `sentence-transformers/all-mpnet-base-v2` (768-dim, balanced)
|
|
671
|
+
|
|
672
|
+
**OpenAI API:**
|
|
673
|
+
- `text-embedding-3-small` (1536-dim)
|
|
674
|
+
- `text-embedding-3-large` (3072-dim)
|
|
675
|
+
- `text-embedding-ada-002` (1536-dim, legacy)
|
|
676
|
+
|
|
677
|
+
**Cohere API:**
|
|
678
|
+
- `embed-english-v3.0`
|
|
679
|
+
- `embed-multilingual-v3.0`
|
|
680
|
+
|
|
681
|
+
**Legacy PgVectorStore (deprecated):**
|
|
682
|
+
```python
|
|
683
|
+
# Use local model
|
|
684
|
+
vector_store = PgVectorStore(
|
|
685
|
+
embedding_model="Snowflake/snowflake-arctic-embed-m",
|
|
686
|
+
use_embedding_api=False
|
|
687
|
+
)
|
|
688
|
+
|
|
689
|
+
# Use OpenAI API
|
|
690
|
+
vector_store = PgVectorStore(
|
|
691
|
+
use_embedding_api=True,
|
|
692
|
+
api_model="text-embedding-3-small"
|
|
693
|
+
)
|
|
694
|
+
```
|
|
695
|
+
|
|
696
|
+
**ConfigurablePgVectorStore (recommended):**
|
|
697
|
+
```python
|
|
698
|
+
from ai_vectorstore.config import VectorStoreConfig, EmbeddingConfig
|
|
699
|
+
from ai_vectorstore.components.vectorstore.configurable_pg_vector_store import ConfigurablePgVectorStore
|
|
700
|
+
|
|
701
|
+
# Local SentenceTransformer model
|
|
702
|
+
config = VectorStoreConfig(
|
|
703
|
+
embedding=EmbeddingConfig(
|
|
704
|
+
model_type="sentence_transformer",
|
|
705
|
+
model_name="Snowflake/snowflake-arctic-embed-m",
|
|
706
|
+
batch_size=32,
|
|
707
|
+
normalize=True
|
|
708
|
+
)
|
|
709
|
+
)
|
|
710
|
+
vector_store = ConfigurablePgVectorStore(config=config)
|
|
711
|
+
|
|
712
|
+
# OpenAI API
|
|
713
|
+
config = VectorStoreConfig(
|
|
714
|
+
embedding=EmbeddingConfig(
|
|
715
|
+
model_type="openai",
|
|
716
|
+
model_name="text-embedding-3-small",
|
|
717
|
+
api_key="your-api-key" # Or set OPENAI_API_KEY env var
|
|
718
|
+
)
|
|
719
|
+
)
|
|
720
|
+
vector_store = ConfigurablePgVectorStore(config=config)
|
|
721
|
+
|
|
722
|
+
# Cohere API
|
|
723
|
+
config = VectorStoreConfig(
|
|
724
|
+
embedding=EmbeddingConfig(
|
|
725
|
+
model_type="cohere",
|
|
726
|
+
model_name="embed-english-v3.0",
|
|
727
|
+
api_key="your-api-key" # Or set COHERE_API_KEY env var
|
|
728
|
+
)
|
|
729
|
+
)
|
|
730
|
+
vector_store = ConfigurablePgVectorStore(config=config)
|
|
731
|
+
```
|
|
732
|
+
|
|
733
|
+
## Examples
|
|
734
|
+
|
|
735
|
+
Comprehensive examples are available in the `examples/ai_vectorstore_examples/` directory:
|
|
736
|
+
|
|
737
|
+
- **`postgres_vectorstore_example.py`** - Full PostgreSQL implementation with hybrid search
|
|
738
|
+
- **`basic_faiss_example.py`** - FAISS in-memory vector store
|
|
739
|
+
- **`run_postgres_example.sh`** - One-command setup script
|
|
740
|
+
|
|
741
|
+
See [examples/ai_vectorstore_examples/README.md](../../examples/ai_vectorstore_examples/README.md) for detailed documentation.
|
|
742
|
+
|
|
743
|
+
## Performance Considerations
|
|
744
|
+
|
|
745
|
+
### PostgreSQL
|
|
746
|
+
- **Indexing**: Uses pgvector's HNSW (Hierarchical Navigable Small World) index for fast approximate nearest neighbor search
|
|
747
|
+
- **Caching**: LRU cache for query results (configurable cache size)
|
|
748
|
+
- **Batch Operations**: Use `add_nodes_batch()` for bulk inserts
|
|
749
|
+
- **Connection Pooling**: Configure Django database connection pooling for production
|
|
750
|
+
|
|
751
|
+
### FAISS
|
|
752
|
+
- **Index Types**: Automatically uses IndexFlatL2 for accuracy; can be customized for larger datasets
|
|
753
|
+
- **Memory Usage**: Entire index in memory; monitor RAM usage with large datasets
|
|
754
|
+
- **Persistence**: Save/load operations can be expensive; use sparingly
|
|
755
|
+
|
|
756
|
+
## Best Practices
|
|
757
|
+
|
|
758
|
+
1. **Choose the Right Backend:**
|
|
759
|
+
- Use `ConfigurablePgVectorStore` for production deployments (recommended)
|
|
760
|
+
- Use legacy `PgVectorStore` for backward compatibility
|
|
761
|
+
- Use FAISS for prototyping, maximum speed, and memory-based applications
|
|
762
|
+
- Leverage configuration files for reproducible deployments
|
|
763
|
+
|
|
764
|
+
2. **Configuration Management:**
|
|
765
|
+
- Use YAML/JSON configuration files for production
|
|
766
|
+
- Store configurations in version control
|
|
767
|
+
- Use environment variables for secrets (API keys, passwords)
|
|
768
|
+
- Validate configurations before deployment
|
|
769
|
+
- Document configuration changes
|
|
770
|
+
|
|
771
|
+
3. **Optimize Embedding Models:**
|
|
772
|
+
- Start with `Snowflake/snowflake-arctic-embed-m` for balanced performance
|
|
773
|
+
- Use smaller models (384-dim) for speed-critical applications
|
|
774
|
+
- Use larger models (1024-dim+) for maximum accuracy
|
|
775
|
+
- Consider API-based models (OpenAI, Cohere) for convenience
|
|
776
|
+
- Batch encode documents for better throughput
|
|
777
|
+
|
|
778
|
+
4. **Leverage Hybrid Search:**
|
|
779
|
+
- Use `hybrid_alpha=0.7` as starting point (favors vector search)
|
|
780
|
+
- Tune alpha based on your specific use case:
|
|
781
|
+
- 0.9-1.0: Pure semantic search
|
|
782
|
+
- 0.5-0.7: Balanced approach
|
|
783
|
+
- 0.0-0.3: Keyword-focused search
|
|
784
|
+
- Enable re-ranking for best results (adds latency but improves relevance)
|
|
785
|
+
- Set `search_buffer_factor=2` to retrieve more candidates for re-ranking
|
|
786
|
+
|
|
787
|
+
5. **Metadata Design:**
|
|
788
|
+
- Store searchable metadata in custom fields
|
|
789
|
+
- Use consistent metadata structure across documents
|
|
790
|
+
- Index frequently queried fields
|
|
791
|
+
- Include version and timestamp metadata
|
|
792
|
+
- Use metadata for filtering and access control
|
|
793
|
+
|
|
794
|
+
6. **Chunking Strategy:**
|
|
795
|
+
- Keep chunks between 200-500 tokens for optimal retrieval
|
|
796
|
+
- Include overlap (50-100 tokens) between chunks for context preservation
|
|
797
|
+
- Store chunk position for result ordering
|
|
798
|
+
- Consider semantic chunking for better coherence
|
|
799
|
+
- Configure `chunk_size` and `chunk_overlap` in IndexConfig
|
|
800
|
+
|
|
801
|
+
7. **Update and Maintenance:**
|
|
802
|
+
- Use `update_vector()` for content corrections without re-indexing
|
|
803
|
+
- Periodically update embeddings with improved models
|
|
804
|
+
- Monitor and clean up outdated nodes
|
|
805
|
+
- Use batch operations for large-scale updates
|
|
806
|
+
- Keep track of node IDs for efficient updates
|
|
807
|
+
|
|
808
|
+
8. **Performance Optimization:**
|
|
809
|
+
- Enable caching with appropriate `cache_size` (default: 1000)
|
|
810
|
+
- Use batch inserts with optimal `batch_insert_size` (default: 100)
|
|
811
|
+
- Monitor database connection pool settings
|
|
812
|
+
- Consider parallel processing for large datasets
|
|
813
|
+
- Use appropriate similarity metric for your use case:
|
|
814
|
+
- `cosine`: Normalized vectors (recommended for most cases)
|
|
815
|
+
- `l2`: Euclidean distance (good for non-normalized vectors)
|
|
816
|
+
- `dot_product`: Fast but requires normalized vectors
|
|
817
|
+
|
|
818
|
+
## Integration with RAG Pipeline
|
|
819
|
+
|
|
820
|
+
```python
|
|
821
|
+
from ai_vectorstore.components.loader.file_loader import FileLoader
|
|
822
|
+
from ai_vectorstore.components.chunker.simple_chunker import SimpleChunker
|
|
823
|
+
from ai_vectorstore.components.embedding_model.openai_embeddings import OpenAIEmbeddings
|
|
824
|
+
from ai_vectorstore.components.vectorstore.pg_vector_store import PgVectorStore
|
|
825
|
+
from ai_vectorstore.components.retriever.basic_retriever import BasicRetriever
|
|
826
|
+
|
|
827
|
+
# 1. Load documents
|
|
828
|
+
loader = FileLoader()
|
|
829
|
+
documents = loader.load("/path/to/documents")
|
|
830
|
+
|
|
831
|
+
# 2. Chunk documents
|
|
832
|
+
chunker = SimpleChunker()
|
|
833
|
+
chunks = chunker.run(documents)
|
|
834
|
+
|
|
835
|
+
# 3. Generate embeddings and store
|
|
836
|
+
vector_store = PgVectorStore()
|
|
837
|
+
vector_store.create_collection("knowledge_base")
|
|
838
|
+
vector_store.add_nodes(chunks, "knowledge_base")
|
|
839
|
+
|
|
840
|
+
# 4. Retrieve relevant context
|
|
841
|
+
retriever = BasicRetriever(vector_store=vector_store)
|
|
842
|
+
relevant_docs = retriever.retrieve(
|
|
843
|
+
query="user question",
|
|
844
|
+
collection_name="knowledge_base",
|
|
845
|
+
top_k=5
|
|
846
|
+
)
|
|
847
|
+
|
|
848
|
+
# 5. Use with your LLM
|
|
849
|
+
# context = "\n\n".join([doc.content for doc in relevant_docs])
|
|
850
|
+
# response = llm.generate(prompt + context)
|
|
851
|
+
```
|
|
852
|
+
|
|
853
|
+
## Troubleshooting
|
|
854
|
+
|
|
855
|
+
### PostgreSQL Connection Issues
|
|
856
|
+
```bash
|
|
857
|
+
# Check if PostgreSQL is running
|
|
858
|
+
docker ps | grep postgres
|
|
859
|
+
|
|
860
|
+
# Check logs
|
|
861
|
+
docker logs postgres-vectorstore
|
|
862
|
+
|
|
863
|
+
# Verify connection
|
|
864
|
+
psql -h localhost -U postgres -d vectorstore_db
|
|
865
|
+
```
|
|
866
|
+
|
|
867
|
+
### Embedding Model Issues
|
|
868
|
+
```bash
|
|
869
|
+
# Clear cache and re-download
|
|
870
|
+
rm -rf ~/.cache/huggingface/hub
|
|
871
|
+
|
|
872
|
+
# Verify GPU availability (for GPU models)
|
|
873
|
+
python -c "import torch; print(torch.cuda.is_available())"
|
|
874
|
+
```
|
|
875
|
+
|
|
876
|
+
### Migration Issues
|
|
877
|
+
```bash
|
|
878
|
+
# Reset migrations (development only!)
|
|
879
|
+
python manage.py migrate ai_vectorstore zero
|
|
880
|
+
python manage.py migrate
|
|
881
|
+
```
|
|
882
|
+
|
|
883
|
+
## API Reference
|
|
884
|
+
|
|
885
|
+
### ConfigurablePgVectorStore (Recommended)
|
|
886
|
+
|
|
887
|
+
**Initialization:**
|
|
888
|
+
- `__init__(name: str, config: Union[VectorStoreConfig, Dict, str])` - Initialize with configuration
|
|
889
|
+
- `setup()` - Setup resources and connections
|
|
890
|
+
|
|
891
|
+
**Collection Management:**
|
|
892
|
+
- `get_or_create_collection(collection_name: str, embedding_dim: int)` - Get or create collection
|
|
893
|
+
- `create_collection_from_nodes(collection_name: str, nodes: List[Node])` - Create from nodes
|
|
894
|
+
- `create_collection_from_files(collection_name: str, files: List[VSFile])` - Create from files
|
|
895
|
+
- `list_collections()` - List all collections
|
|
896
|
+
- `get_collection_info(collection_name: str)` - Get collection information
|
|
897
|
+
- `delete_collection(collection_name: str)` - Delete a collection
|
|
898
|
+
|
|
899
|
+
**Node Operations:**
|
|
900
|
+
- `add_nodes(collection_name: str, nodes: List[Node])` - Add nodes to collection
|
|
901
|
+
- `update_vector(collection_name: str, node_id: int, new_content: str, new_embedding: List[float], new_metadata: Dict)` - Update existing vector
|
|
902
|
+
- `delete_nodes(collection_name: str, node_ids: List[int])` - Delete specific nodes
|
|
903
|
+
|
|
904
|
+
**Search:**
|
|
905
|
+
- `search(collection_name: str, query: str, distance_type: str, number: int, meta_data_filters: Dict, hybrid_search: bool)` - Search collection with hybrid support
|
|
906
|
+
- `query(vector: List[float], top_k: int, **kwargs)` - Query by vector (VectorStore interface)
|
|
907
|
+
|
|
908
|
+
**Lifecycle:**
|
|
909
|
+
- `shutdown()` - Cleanup and shutdown
|
|
910
|
+
|
|
911
|
+
### PgVectorStore (Legacy)
|
|
912
|
+
- `create_collection(name: str)` - Create a new collection
|
|
913
|
+
- `add_nodes(nodes: List[Node], collection_name: str)` - Add nodes to collection
|
|
914
|
+
- `search(query: str, collection_name: str, search_type: str, top_k: int)` - Search collection
|
|
915
|
+
- `delete_collection(name: str)` - Delete a collection
|
|
916
|
+
- `list_collections()` - List all collections
|
|
917
|
+
- `get_collection_stats(name: str)` - Get collection statistics
|
|
918
|
+
|
|
919
|
+
### FaissStore
|
|
920
|
+
- `create_collection(name: str)` - Create a new collection
|
|
921
|
+
- `add_nodes(nodes: List[Node], collection_name: str)` - Add nodes to collection
|
|
922
|
+
- `query_nodes(query: str, collection_name: str, top_k: int)` - Query collection
|
|
923
|
+
- `save_vector_store()` - Save indexes to disk
|
|
924
|
+
- `load_vector_store()` - Load indexes from disk
|
|
925
|
+
|
|
926
|
+
### Configuration API
|
|
927
|
+
|
|
928
|
+
**VectorStoreConfig:**
|
|
929
|
+
- `from_dict(config_dict: Dict)` - Create from dictionary
|
|
930
|
+
- `from_yaml(yaml_path: str)` - Load from YAML file
|
|
931
|
+
- `from_json(json_path: str)` - Load from JSON file
|
|
932
|
+
- `to_dict()` - Convert to dictionary
|
|
933
|
+
- `save_yaml(output_path: str)` - Save to YAML file
|
|
934
|
+
- `save_json(output_path: str)` - Save to JSON file
|
|
935
|
+
- `validate()` - Validate configuration
|
|
936
|
+
|
|
937
|
+
**load_config:**
|
|
938
|
+
- `load_config(config_source: Union[str, Dict], config_type: str)` - Universal config loader
|
|
939
|
+
|
|
940
|
+
## Migration Guide
|
|
941
|
+
|
|
942
|
+
### From PgVectorStore to ConfigurablePgVectorStore
|
|
943
|
+
|
|
944
|
+
If you're using the legacy `PgVectorStore`, here's how to migrate to the new `ConfigurablePgVectorStore`:
|
|
945
|
+
|
|
946
|
+
**Before (Legacy):**
|
|
947
|
+
```python
|
|
948
|
+
from ai_vectorstore.components.vectorstore.pg_vector_store import PgVectorStore
|
|
949
|
+
|
|
950
|
+
vector_store = PgVectorStore(
|
|
951
|
+
name="my_store",
|
|
952
|
+
embedding_model="Snowflake/snowflake-arctic-embed-m",
|
|
953
|
+
use_embedding_api=False
|
|
954
|
+
)
|
|
955
|
+
|
|
956
|
+
vector_store.create_collection("docs")
|
|
957
|
+
vector_store.add_nodes(nodes, "docs")
|
|
958
|
+
results = vector_store.search(
|
|
959
|
+
query="test",
|
|
960
|
+
collection_name="docs",
|
|
961
|
+
search_type="hybrid",
|
|
962
|
+
top_k=5
|
|
963
|
+
)
|
|
964
|
+
```
|
|
965
|
+
|
|
966
|
+
**After (Configurable):**
|
|
967
|
+
```python
|
|
968
|
+
from ai_vectorstore.components.vectorstore.configurable_pg_vector_store import ConfigurablePgVectorStore
|
|
969
|
+
from ai_vectorstore.config import VectorStoreConfig, EmbeddingConfig
|
|
970
|
+
|
|
971
|
+
# Option 1: Simple migration with defaults
|
|
972
|
+
vector_store = ConfigurablePgVectorStore(name="my_store")
|
|
973
|
+
vector_store.setup()
|
|
974
|
+
|
|
975
|
+
# Option 2: With explicit configuration
|
|
976
|
+
config = VectorStoreConfig(
|
|
977
|
+
embedding=EmbeddingConfig(
|
|
978
|
+
model_type="sentence_transformer",
|
|
979
|
+
model_name="Snowflake/snowflake-arctic-embed-m"
|
|
980
|
+
)
|
|
981
|
+
)
|
|
982
|
+
vector_store = ConfigurablePgVectorStore(name="my_store", config=config)
|
|
983
|
+
vector_store.setup()
|
|
984
|
+
|
|
985
|
+
# Collections are auto-created
|
|
986
|
+
vector_store.get_or_create_collection("docs")
|
|
987
|
+
vector_store.add_nodes("docs", nodes)
|
|
988
|
+
|
|
989
|
+
# Search (note: different return format)
|
|
990
|
+
results_dict, result_nodes = vector_store.search(
|
|
991
|
+
collection_name="docs",
|
|
992
|
+
query="test",
|
|
993
|
+
hybrid_search=True,
|
|
994
|
+
number=5
|
|
995
|
+
)
|
|
996
|
+
```
|
|
997
|
+
|
|
998
|
+
**Key Differences:**
|
|
999
|
+
1. Must call `setup()` after initialization
|
|
1000
|
+
2. Use `get_or_create_collection()` instead of `create_collection()`
|
|
1001
|
+
3. `search()` returns tuple of (results_dict, nodes_list)
|
|
1002
|
+
4. Search parameters: `top_k` → `number`, `search_type` → `hybrid_search` (bool)
|
|
1003
|
+
5. Configuration is centralized and supports YAML/JSON files
|
|
1004
|
+
6. New features: `update_vector()`, `delete_nodes()`, `get_collection_info()`
|
|
1005
|
+
|
|
1006
|
+
## Contributing
|
|
1007
|
+
|
|
1008
|
+
This package is part of the Rakam Systems framework. For contributions and development:
|
|
1009
|
+
|
|
1010
|
+
1. Follow the existing interface patterns in `ai_core.interfaces`
|
|
1011
|
+
2. Add comprehensive docstrings and type hints
|
|
1012
|
+
3. Include examples in the `examples/` directory
|
|
1013
|
+
4. Update this README with new features
|
|
1014
|
+
5. Test with both legacy and configurable vector stores
|
|
1015
|
+
6. Ensure backward compatibility when possible
|
|
1016
|
+
|
|
1017
|
+
## License
|
|
1018
|
+
|
|
1019
|
+
See [LICENSE](LICENSE) file for details.
|
|
1020
|
+
|
|
1021
|
+
## Additional Resources
|
|
1022
|
+
|
|
1023
|
+
### Rakam Systems Documentation
|
|
1024
|
+
- [Main Rakam Systems Documentation](../../README.md)
|
|
1025
|
+
- [AI Core Interfaces](../ai_core/interfaces/)
|
|
1026
|
+
- [AI Agents Framework](../ai_agents/)
|
|
1027
|
+
- [Examples Directory](../../examples/ai_vectorstore_examples/)
|
|
1028
|
+
|
|
1029
|
+
### External Documentation
|
|
1030
|
+
- [pgvector Documentation](https://github.com/pgvector/pgvector) - PostgreSQL vector extension
|
|
1031
|
+
- [FAISS Documentation](https://github.com/facebookresearch/faiss) - Facebook AI Similarity Search
|
|
1032
|
+
- [SentenceTransformers Documentation](https://www.sbert.net/) - Sentence embedding models
|
|
1033
|
+
- [OpenAI Embeddings API](https://platform.openai.com/docs/guides/embeddings) - OpenAI embedding models
|
|
1034
|
+
- [Cohere Embeddings API](https://docs.cohere.com/docs/embeddings) - Cohere embedding models
|
|
1035
|
+
|
|
1036
|
+
### Configuration Examples
|
|
1037
|
+
- See [examples/configs/pg_vectorstore_config.yaml](../../examples/configs/pg_vectorstore_config.yaml) for example YAML configuration
|
|
1038
|
+
- See [examples/ai_vectorstore_examples/](../../examples/ai_vectorstore_examples/) for working examples
|
|
1039
|
+
|
|
1040
|
+
### Key Files
|
|
1041
|
+
- `config.py` - Configuration system implementation
|
|
1042
|
+
- `core.py` - Core data structures (VSFile, Node, NodeMetadata)
|
|
1043
|
+
- `components/vectorstore/configurable_pg_vector_store.py` - Enhanced configurable vector store
|
|
1044
|
+
- `components/vectorstore/pg_vector_store.py` - Legacy vector store
|
|
1045
|
+
- `components/embedding_model/configurable_embeddings.py` - Pluggable embedding system
|
|
1046
|
+
|
|
1047
|
+
## Support and Contributing
|
|
1048
|
+
|
|
1049
|
+
For issues, questions, or contributions, please refer to the main Rakam Systems repository. When reporting issues:
|
|
1050
|
+
|
|
1051
|
+
1. Include your configuration (sanitized of secrets)
|
|
1052
|
+
2. Provide code snippets demonstrating the issue
|
|
1053
|
+
3. Include error messages and stack traces
|
|
1054
|
+
4. Specify versions of key dependencies (Django, pgvector, etc.)
|
|
1055
|
+
|
|
1056
|
+
## Changelog
|
|
1057
|
+
|
|
1058
|
+
### v1.1.0 (Latest)
|
|
1059
|
+
- Added `ConfigurablePgVectorStore` with full configuration support
|
|
1060
|
+
- Introduced centralized configuration system (`config.py`)
|
|
1061
|
+
- Added `ConfigurableEmbeddings` for pluggable embedding models
|
|
1062
|
+
- Implemented `update_vector()` for in-place updates
|
|
1063
|
+
- Added support for multiple similarity metrics
|
|
1064
|
+
- Enhanced hybrid search with configurable alpha parameter
|
|
1065
|
+
- Improved documentation with migration guide
|
|
1066
|
+
|
|
1067
|
+
### v1.0.0
|
|
1068
|
+
- Initial release with `PgVectorStore` and `FaissStore`
|
|
1069
|
+
- Basic hybrid search support
|
|
1070
|
+
- Django ORM integration
|
|
1071
|
+
- Collection-based organization
|