rakam-systems-vectorstore 0.1.1rc7__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. rakam_systems_vectorstore/MANIFEST.in +26 -0
  2. rakam_systems_vectorstore/README.md +1071 -0
  3. rakam_systems_vectorstore/__init__.py +93 -0
  4. rakam_systems_vectorstore/components/__init__.py +0 -0
  5. rakam_systems_vectorstore/components/chunker/__init__.py +19 -0
  6. rakam_systems_vectorstore/components/chunker/advanced_chunker.py +1019 -0
  7. rakam_systems_vectorstore/components/chunker/text_chunker.py +154 -0
  8. rakam_systems_vectorstore/components/embedding_model/__init__.py +0 -0
  9. rakam_systems_vectorstore/components/embedding_model/configurable_embeddings.py +546 -0
  10. rakam_systems_vectorstore/components/embedding_model/openai_embeddings.py +259 -0
  11. rakam_systems_vectorstore/components/loader/__init__.py +31 -0
  12. rakam_systems_vectorstore/components/loader/adaptive_loader.py +512 -0
  13. rakam_systems_vectorstore/components/loader/code_loader.py +699 -0
  14. rakam_systems_vectorstore/components/loader/doc_loader.py +812 -0
  15. rakam_systems_vectorstore/components/loader/eml_loader.py +556 -0
  16. rakam_systems_vectorstore/components/loader/html_loader.py +626 -0
  17. rakam_systems_vectorstore/components/loader/md_loader.py +622 -0
  18. rakam_systems_vectorstore/components/loader/odt_loader.py +750 -0
  19. rakam_systems_vectorstore/components/loader/pdf_loader.py +771 -0
  20. rakam_systems_vectorstore/components/loader/pdf_loader_light.py +723 -0
  21. rakam_systems_vectorstore/components/loader/tabular_loader.py +597 -0
  22. rakam_systems_vectorstore/components/vectorstore/__init__.py +0 -0
  23. rakam_systems_vectorstore/components/vectorstore/apps.py +10 -0
  24. rakam_systems_vectorstore/components/vectorstore/configurable_pg_vector_store.py +1661 -0
  25. rakam_systems_vectorstore/components/vectorstore/faiss_vector_store.py +878 -0
  26. rakam_systems_vectorstore/components/vectorstore/migrations/0001_initial.py +55 -0
  27. rakam_systems_vectorstore/components/vectorstore/migrations/__init__.py +0 -0
  28. rakam_systems_vectorstore/components/vectorstore/models.py +10 -0
  29. rakam_systems_vectorstore/components/vectorstore/pg_models.py +97 -0
  30. rakam_systems_vectorstore/components/vectorstore/pg_vector_store.py +827 -0
  31. rakam_systems_vectorstore/config.py +266 -0
  32. rakam_systems_vectorstore/core.py +8 -0
  33. rakam_systems_vectorstore/pyproject.toml +113 -0
  34. rakam_systems_vectorstore/server/README.md +290 -0
  35. rakam_systems_vectorstore/server/__init__.py +20 -0
  36. rakam_systems_vectorstore/server/mcp_server_vector.py +325 -0
  37. rakam_systems_vectorstore/setup.py +103 -0
  38. rakam_systems_vectorstore-0.1.1rc7.dist-info/METADATA +370 -0
  39. rakam_systems_vectorstore-0.1.1rc7.dist-info/RECORD +40 -0
  40. rakam_systems_vectorstore-0.1.1rc7.dist-info/WHEEL +4 -0
@@ -0,0 +1,1071 @@
1
+ # AI VectorStore
2
+
3
+ A modular, production-ready vector store system for semantic search and retrieval-augmented generation (RAG) applications. Part of the Rakam Systems AI framework.
4
+
5
+ ## Overview
6
+
7
+ The `ai_vectorstore` package provides a comprehensive, production-ready set of components for building vector-based search and retrieval systems. It supports multiple backend implementations (PostgreSQL with pgvector, FAISS) and includes all necessary components for a complete RAG pipeline.
8
+
9
+ ### What's New
10
+
11
+ - ✨ **ConfigurablePgVectorStore**: Enhanced vector store with full YAML/JSON configuration support
12
+ - ⚙️ **Configuration System**: Centralized configuration management with validation and environment variable support
13
+ - 🔄 **Update Operations**: Update existing vectors, embeddings, and metadata without re-indexing
14
+ - 🔌 **Pluggable Embeddings**: Support for multiple embedding providers (SentenceTransformers, OpenAI, Cohere)
15
+ - 📊 **Enhanced Search**: Multiple similarity metrics (cosine, L2, dot product) with configurable hybrid search
16
+ - 🎯 **Better DX**: Improved developer experience with clearer APIs and comprehensive documentation
17
+
18
+ ### Quick Links
19
+
20
+ - [Installation](#installation)
21
+ - [Quick Start](#quick-start)
22
+ - [Configurable Vector Store (Recommended)](#configurable-vector-store-recommended)
23
+ - [Configuration](#configuration)
24
+ - [API Reference](#api-reference)
25
+ - [Migration Guide](#migration-guide)
26
+
27
+ ## Features
28
+
29
+ - **Multiple Vector Store Backends**
30
+ - PostgreSQL with pgvector (persistent, production-ready)
31
+ - FAISS (in-memory, high-performance)
32
+ - Configurable vector store with YAML/JSON configuration support
33
+
34
+ - **Hybrid Search Capabilities**
35
+ - Vector similarity search (cosine, L2, dot product)
36
+ - Full-text search (PostgreSQL)
37
+ - Combined hybrid search with configurable weighting
38
+ - Adjustable alpha parameter for search balance
39
+
40
+ - **Advanced Retrieval**
41
+ - Built-in re-ranking for improved relevance
42
+ - Metadata filtering with Django ORM support
43
+ - Collection-based organization
44
+ - LRU caching for query performance optimization
45
+ - Update operations for existing vectors
46
+
47
+ - **Flexible Embedding Options**
48
+ - Local embedding models (SentenceTransformers)
49
+ - OpenAI API integration
50
+ - Cohere API support
51
+ - Configurable embedding dimensions
52
+ - Pluggable embedding backends
53
+
54
+ - **Complete RAG Pipeline Components**
55
+ - Document loaders (file, adaptive)
56
+ - Text chunkers (simple, text-based)
57
+ - Embedding models (configurable, OpenAI)
58
+ - Indexers (simple indexer)
59
+ - Retrievers (basic retriever)
60
+ - Re-rankers (model-based)
61
+
62
+ - **Configuration Management**
63
+ - YAML/JSON configuration files
64
+ - Environment variable overrides
65
+ - Programmatic configuration
66
+ - Configuration validation and defaults
67
+
68
+ ## Architecture
69
+
70
+ The package follows a modular architecture with clear interfaces:
71
+
72
+ ```
73
+ ai_vectorstore/
74
+ ├── core.py # Core data structures (VSFile, Node, NodeMetadata)
75
+ ├── config.py # Configuration management system
76
+ ├── components/
77
+ │ ├── vectorstore/ # Vector store implementations
78
+ │ │ ├── pg_vector_store.py # PostgreSQL backend
79
+ │ │ ├── configurable_pg_vector_store.py # Enhanced configurable backend
80
+ │ │ ├── faiss_vector_store.py # FAISS backend
81
+ │ │ ├── pg_models.py # Django ORM models
82
+ │ │ └── migrations/ # Database migrations
83
+ │ ├── chunker/ # Text chunking components
84
+ │ │ ├── simple_chunker.py # Basic text chunker
85
+ │ │ └── text_chunker.py # Advanced text chunker
86
+ │ ├── embedding_model/ # Embedding generation
87
+ │ │ ├── configurable_embeddings.py # Pluggable embedding system
88
+ │ │ └── openai_embeddings.py # OpenAI API embeddings
89
+ │ ├── indexer/ # Document indexing
90
+ │ │ └── simple_indexer.py # Basic indexer
91
+ │ ├── loader/ # Document loading
92
+ │ │ ├── file_loader.py # File loading
93
+ │ │ └── adaptive_loader.py # Adaptive loading
94
+ │ ├── reranker/ # Result re-ranking
95
+ │ │ └── model_reranker.py # Model-based reranker
96
+ │ └── retriever/ # Search and retrieval
97
+ │ └── basic_retriever.py # Basic retriever
98
+ └── server/ # MCP server implementation
99
+ └── mcp_server_vector.py
100
+ ```
101
+
102
+ ## Installation
103
+
104
+ ### As Part of Rakam Systems (Recommended)
105
+
106
+ ```bash
107
+ cd app/rakam_systems
108
+
109
+ # Full AI Vectorstore with all features
110
+ pip install -e ".[ai-vectorstore]"
111
+ ```
112
+
113
+ ### Standalone Installation
114
+
115
+ For granular control over dependencies:
116
+
117
+ ```bash
118
+ cd app/rakam_systems/rakam_systems/ai_vectorstore
119
+
120
+ # Core only (minimal)
121
+ pip install -e .
122
+
123
+ # With PostgreSQL backend
124
+ pip install -e ".[postgres]"
125
+
126
+ # With FAISS backend
127
+ pip install -e ".[faiss]"
128
+
129
+ # With local embeddings (SentenceTransformers)
130
+ pip install -e ".[local-embeddings]"
131
+
132
+ # With OpenAI or Cohere embeddings
133
+ pip install -e ".[openai]" # or ".[cohere]"
134
+
135
+ # With document loaders
136
+ pip install -e ".[loaders]"
137
+
138
+ # Everything
139
+ pip install -e ".[all]"
140
+ ```
141
+
142
+ **📖 See [INSTALLATION.md](INSTALLATION.md) for complete installation guide** | **⚡ [QUICK_INSTALL.md](QUICK_INSTALL.md) for quick reference**
143
+
144
+ ## Quick Start
145
+
146
+ ### PostgreSQL Vector Store
147
+
148
+ ```python
149
+ from ai_vectorstore.components.vectorstore.pg_vector_store import PgVectorStore
150
+ from ai_vectorstore.core import Node, NodeMetadata, VSFile
151
+
152
+ # Initialize the vector store
153
+ vector_store = PgVectorStore(
154
+ name="my_vector_store",
155
+ embedding_model="Snowflake/snowflake-arctic-embed-m",
156
+ use_embedding_api=False # Use local model
157
+ )
158
+
159
+ # Create a collection
160
+ collection_name = "documents"
161
+ vector_store.create_collection(collection_name)
162
+
163
+ # Add documents
164
+ nodes = [
165
+ Node(
166
+ content="Your document content here",
167
+ metadata=NodeMetadata(
168
+ source_file_uuid="file-uuid",
169
+ position=0,
170
+ custom={"title": "Document Title"}
171
+ )
172
+ )
173
+ ]
174
+ vector_store.add_nodes(nodes, collection_name)
175
+
176
+ # Search
177
+ results = vector_store.search(
178
+ query="Your search query",
179
+ collection_name=collection_name,
180
+ top_k=5,
181
+ search_type="hybrid" # or "vector", "fts"
182
+ )
183
+ ```
184
+
185
+ ### FAISS Vector Store
186
+
187
+ ```python
188
+ from ai_vectorstore.components.vectorstore.faiss_vector_store import FaissStore
189
+ from ai_vectorstore.core import Node, NodeMetadata
190
+
191
+ # Initialize FAISS store
192
+ faiss_store = FaissStore(
193
+ name="my_faiss_store",
194
+ base_index_path="./faiss_indexes",
195
+ embedding_model="Snowflake/snowflake-arctic-embed-m"
196
+ )
197
+
198
+ # Create collection and add nodes
199
+ collection_name = "documents"
200
+ faiss_store.create_collection(collection_name)
201
+
202
+ nodes = [
203
+ Node(
204
+ content="Your content here",
205
+ metadata=NodeMetadata(
206
+ source_file_uuid="file-uuid",
207
+ position=0
208
+ )
209
+ )
210
+ ]
211
+ faiss_store.add_nodes(nodes, collection_name)
212
+
213
+ # Search
214
+ results = faiss_store.query_nodes(
215
+ query="search query",
216
+ collection_name=collection_name,
217
+ top_k=5
218
+ )
219
+ ```
220
+
221
+ ## Configurable Vector Store (Recommended)
222
+
223
+ The `ConfigurablePgVectorStore` is an enhanced, production-ready vector store that supports configuration via YAML/JSON files or dictionaries. This is the recommended approach for production deployments.
224
+
225
+ ### Configuration System
226
+
227
+ The configuration system provides a unified way to manage all vector store settings:
228
+
229
+ ```python
230
+ from ai_vectorstore.config import VectorStoreConfig, EmbeddingConfig, SearchConfig
231
+ from ai_vectorstore.components.vectorstore.configurable_pg_vector_store import ConfigurablePgVectorStore
232
+
233
+ # Option 1: Use defaults
234
+ vector_store = ConfigurablePgVectorStore()
235
+
236
+ # Option 2: Load from YAML file
237
+ vector_store = ConfigurablePgVectorStore(
238
+ name="my_store",
239
+ config="/path/to/config.yaml"
240
+ )
241
+
242
+ # Option 3: Programmatic configuration
243
+ config = VectorStoreConfig(
244
+ name="custom_store",
245
+ embedding=EmbeddingConfig(
246
+ model_type="sentence_transformer",
247
+ model_name="Snowflake/snowflake-arctic-embed-m",
248
+ batch_size=32
249
+ ),
250
+ search=SearchConfig(
251
+ similarity_metric="cosine",
252
+ default_top_k=5,
253
+ enable_hybrid_search=True,
254
+ hybrid_alpha=0.7,
255
+ rerank=True
256
+ )
257
+ )
258
+ vector_store = ConfigurablePgVectorStore(name="my_store", config=config)
259
+ ```
260
+
261
+ ### YAML Configuration Example
262
+
263
+ Create a `vectorstore_config.yaml` file:
264
+
265
+ ```yaml
266
+ name: production_vectorstore
267
+
268
+ embedding:
269
+ model_type: sentence_transformer
270
+ model_name: Snowflake/snowflake-arctic-embed-m
271
+ batch_size: 32
272
+ normalize: true
273
+
274
+ database:
275
+ host: localhost
276
+ port: 5432
277
+ database: vectorstore_db
278
+ user: postgres
279
+ password: postgres
280
+
281
+ search:
282
+ similarity_metric: cosine
283
+ default_top_k: 5
284
+ enable_hybrid_search: true
285
+ hybrid_alpha: 0.7
286
+ rerank: true
287
+ search_buffer_factor: 2
288
+
289
+ index:
290
+ chunk_size: 512
291
+ chunk_overlap: 50
292
+ batch_insert_size: 100
293
+
294
+ enable_caching: true
295
+ cache_size: 1000
296
+ enable_logging: true
297
+ log_level: INFO
298
+ ```
299
+
300
+ ### Using ConfigurablePgVectorStore
301
+
302
+ ```python
303
+ from ai_vectorstore.components.vectorstore.configurable_pg_vector_store import ConfigurablePgVectorStore
304
+ from ai_vectorstore.core import Node, NodeMetadata
305
+
306
+ # Initialize with config
307
+ store = ConfigurablePgVectorStore(
308
+ name="my_vectorstore",
309
+ config="config.yaml"
310
+ )
311
+ store.setup()
312
+
313
+ # Create collection
314
+ store.get_or_create_collection("documents")
315
+
316
+ # Add nodes
317
+ nodes = [
318
+ Node(
319
+ content="Document content here",
320
+ metadata=NodeMetadata(
321
+ source_file_uuid="file-123",
322
+ position=0,
323
+ custom={"title": "Document Title"}
324
+ )
325
+ )
326
+ ]
327
+ store.add_nodes("documents", nodes)
328
+
329
+ # Search with configuration defaults
330
+ results, nodes = store.search(
331
+ collection_name="documents",
332
+ query="search query"
333
+ )
334
+
335
+ # Override configuration at search time
336
+ results, nodes = store.search(
337
+ collection_name="documents",
338
+ query="search query",
339
+ distance_type="l2",
340
+ number=10,
341
+ hybrid_search=False
342
+ )
343
+
344
+ # Update existing vectors
345
+ store.update_vector(
346
+ collection_name="documents",
347
+ node_id=1,
348
+ new_content="Updated content", # Will regenerate embedding
349
+ new_metadata={"status": "reviewed"}
350
+ )
351
+
352
+ # Delete specific nodes
353
+ store.delete_nodes("documents", [1, 2, 3])
354
+
355
+ # Get collection information
356
+ info = store.get_collection_info("documents")
357
+ print(f"Collection has {info['node_count']} nodes")
358
+ ```
359
+
360
+ ### Configuration Classes
361
+
362
+ **EmbeddingConfig** - Embedding model configuration:
363
+ - `model_type`: "sentence_transformer", "openai", "cohere"
364
+ - `model_name`: Model identifier
365
+ - `api_key`: API key (auto-loaded from environment)
366
+ - `batch_size`: Batch size for embeddings
367
+ - `normalize`: Normalize embeddings
368
+ - `dimensions`: Embedding dimensions (auto-detected)
369
+
370
+ **DatabaseConfig** - Database connection settings:
371
+ - `host`, `port`, `database`, `user`, `password`
372
+ - `pool_size`, `max_overflow`: Connection pooling
373
+ - Auto-loads from `POSTGRES_*` environment variables
374
+
375
+ **SearchConfig** - Search behavior configuration:
376
+ - `similarity_metric`: "cosine", "l2", "dot_product"
377
+ - `default_top_k`: Default number of results
378
+ - `enable_hybrid_search`: Enable hybrid search
379
+ - `hybrid_alpha`: Vector/keyword balance (0-1)
380
+ - `rerank`: Enable re-ranking
381
+ - `search_buffer_factor`: Buffer for re-ranking
382
+
383
+ **IndexConfig** - Indexing configuration:
384
+ - `chunk_size`, `chunk_overlap`: Chunking parameters
385
+ - `enable_parallel_processing`: Parallel processing
386
+ - `parallel_workers`: Number of workers
387
+ - `batch_insert_size`: Batch size for inserts
388
+
389
+ ## Core Data Structures
390
+
391
+ ### VSFile
392
+ Represents a source file to be processed:
393
+ ```python
394
+ from ai_vectorstore.core import VSFile
395
+
396
+ file = VSFile(file_path="/path/to/document.pdf")
397
+ # Automatically extracts filename, MIME type, and UUID
398
+ ```
399
+
400
+ ### Node
401
+ Represents a chunk of content with metadata:
402
+ ```python
403
+ from ai_vectorstore.core import Node, NodeMetadata
404
+
405
+ node = Node(
406
+ content="Document chunk text",
407
+ metadata=NodeMetadata(
408
+ source_file_uuid="file-uuid",
409
+ position=0, # page number or chunk position
410
+ custom={"author": "John Doe", "date": "2025-01-01"}
411
+ )
412
+ )
413
+ ```
414
+
415
+ ### NodeMetadata
416
+ Stores metadata about each node:
417
+ - `node_id`: Unique identifier (auto-assigned)
418
+ - `source_file_uuid`: Reference to source file
419
+ - `position`: Position in source (page number, chunk index, etc.)
420
+ - `custom`: Dictionary for arbitrary metadata
421
+
422
+ ## PostgreSQL Backend
423
+
424
+ ### Features
425
+ - ✅ Persistent storage with ACID transactions
426
+ - ✅ Hybrid search (vector + full-text)
427
+ - ✅ Built-in re-ranking with cross-encoder models
428
+ - ✅ Metadata filtering and custom queries
429
+ - ✅ Collection-based organization
430
+ - ✅ Query result caching (LRU)
431
+ - ✅ Django ORM integration
432
+ - ✅ Update operations (content, embeddings, metadata)
433
+ - ✅ Multiple similarity metrics (cosine, L2, dot product)
434
+ - ✅ Configuration-driven architecture
435
+
436
+ ### Setup
437
+
438
+ 1. **Start PostgreSQL with pgvector:**
439
+ ```bash
440
+ docker run -d \
441
+ --name postgres-vectorstore \
442
+ -e POSTGRES_PASSWORD=postgres \
443
+ -e POSTGRES_DB=vectorstore_db \
444
+ -p 5432:5432 \
445
+ pgvector/pgvector:pg16
446
+ ```
447
+
448
+ 2. **Configure Django settings:**
449
+ ```python
450
+ import os
451
+ os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'your_app.settings')
452
+
453
+ # In your Django settings:
454
+ DATABASES = {
455
+ 'default': {
456
+ 'ENGINE': 'django.db.backends.postgresql',
457
+ 'NAME': 'vectorstore_db',
458
+ 'USER': 'postgres',
459
+ 'PASSWORD': 'postgres',
460
+ 'HOST': 'localhost',
461
+ 'PORT': '5432',
462
+ }
463
+ }
464
+
465
+ INSTALLED_APPS = [
466
+ 'ai_vectorstore.components.vectorstore',
467
+ # ... other apps
468
+ ]
469
+ ```
470
+
471
+ 3. **Run migrations:**
472
+ ```bash
473
+ python manage.py migrate
474
+ ```
475
+
476
+ ### Search Types
477
+
478
+ **Vector Search** - Pure semantic similarity:
479
+ ```python
480
+ results = vector_store.search(
481
+ query="machine learning",
482
+ collection_name="docs",
483
+ search_type="vector",
484
+ top_k=5
485
+ )
486
+ ```
487
+
488
+ **Full-Text Search** - Traditional keyword matching:
489
+ ```python
490
+ results = vector_store.search(
491
+ query="exact phrase match",
492
+ collection_name="docs",
493
+ search_type="fts",
494
+ top_k=5
495
+ )
496
+ ```
497
+
498
+ **Hybrid Search** - Combined approach (best results):
499
+ ```python
500
+ results = vector_store.search(
501
+ query="machine learning algorithms",
502
+ collection_name="docs",
503
+ search_type="hybrid",
504
+ alpha=0.5, # 0=pure FTS, 1=pure vector
505
+ top_k=5,
506
+ rerank=True # Enable cross-encoder re-ranking
507
+ )
508
+ ```
509
+
510
+ ### Advanced Features
511
+
512
+ **Metadata Filtering:**
513
+ ```python
514
+ # Custom filtering with Django ORM
515
+ from ai_vectorstore.components.vectorstore.pg_models import NodeEntry
516
+
517
+ filtered_results = vector_store.search(
518
+ query="query text",
519
+ collection_name="docs",
520
+ filter_func=lambda qs: qs.filter(
521
+ metadata__custom__author="John Doe"
522
+ )
523
+ )
524
+ ```
525
+
526
+ **Batch Operations:**
527
+ ```python
528
+ # Batch add nodes
529
+ vector_store.add_nodes_batch(
530
+ nodes_list=all_nodes,
531
+ collection_name="docs",
532
+ batch_size=100
533
+ )
534
+ ```
535
+
536
+ **Collection Management:**
537
+ ```python
538
+ # List collections
539
+ collections = vector_store.list_collections()
540
+
541
+ # Delete collection
542
+ vector_store.delete_collection("old_collection")
543
+
544
+ # Get collection stats (PgVectorStore)
545
+ stats = vector_store.get_collection_stats("docs")
546
+
547
+ # Get collection info (ConfigurablePgVectorStore)
548
+ info = vector_store.get_collection_info("docs")
549
+ print(f"Nodes: {info['node_count']}, Dimensions: {info['embedding_dim']}")
550
+ ```
551
+
552
+ **Update Operations (ConfigurablePgVectorStore):**
553
+ ```python
554
+ from ai_vectorstore.components.vectorstore.configurable_pg_vector_store import ConfigurablePgVectorStore
555
+
556
+ store = ConfigurablePgVectorStore(config="config.yaml")
557
+
558
+ # Update content (will regenerate embedding automatically)
559
+ store.update_vector(
560
+ collection_name="docs",
561
+ node_id=123,
562
+ new_content="Updated document content",
563
+ new_metadata={"status": "reviewed", "version": 2}
564
+ )
565
+
566
+ # Update only embedding (e.g., with improved model)
567
+ new_embedding = embedding_model.encode("document content")
568
+ store.update_vector(
569
+ collection_name="docs",
570
+ node_id=123,
571
+ new_embedding=new_embedding
572
+ )
573
+
574
+ # Update only metadata
575
+ store.update_vector(
576
+ collection_name="docs",
577
+ node_id=123,
578
+ new_metadata={"priority": "high"}
579
+ )
580
+
581
+ # Delete specific nodes
582
+ store.delete_nodes(collection_name="docs", node_ids=[123, 456, 789])
583
+ ```
584
+
585
+ ## FAISS Backend
586
+
587
+ ### Features
588
+ - ✅ In-memory, extremely fast search
589
+ - ✅ No database required
590
+ - ✅ Persistent index saving/loading
591
+ - ✅ Collection-based organization
592
+ - ✅ Perfect for prototyping and development
593
+
594
+ ### Usage
595
+
596
+ ```python
597
+ from ai_vectorstore.components.vectorstore.faiss_vector_store import FaissStore
598
+
599
+ # Initialize
600
+ store = FaissStore(
601
+ base_index_path="./indexes",
602
+ embedding_model="Snowflake/snowflake-arctic-embed-m"
603
+ )
604
+
605
+ # Operations are similar to PostgreSQL backend
606
+ store.create_collection("docs")
607
+ store.add_nodes(nodes, "docs")
608
+ results = store.query_nodes("query", "docs", top_k=5)
609
+
610
+ # Save/load indexes
611
+ store.save_vector_store() # Auto-saves to base_index_path
612
+ store.load_vector_store() # Loads on init if not initialising=True
613
+ ```
614
+
615
+ ## Configuration
616
+
617
+ ### Environment Variables
618
+
619
+ The configuration system automatically loads values from environment variables:
620
+
621
+ ```bash
622
+ # Embedding API Keys
623
+ export OPENAI_API_KEY="your-openai-api-key"
624
+ export COHERE_API_KEY="your-cohere-api-key"
625
+
626
+ # PostgreSQL Connection (auto-loaded by DatabaseConfig)
627
+ export POSTGRES_DB="vectorstore_db"
628
+ export POSTGRES_USER="postgres"
629
+ export POSTGRES_PASSWORD="postgres"
630
+ export POSTGRES_HOST="localhost"
631
+ export POSTGRES_PORT="5432"
632
+
633
+ # Django Settings
634
+ export DJANGO_SETTINGS_MODULE="your_app.settings"
635
+ ```
636
+
637
+ ### Loading and Saving Configurations
638
+
639
+ ```python
640
+ from ai_vectorstore.config import VectorStoreConfig, load_config
641
+
642
+ # Load from file (auto-detects format)
643
+ config = load_config("config.yaml") # or config.json
644
+
645
+ # Load from dictionary
646
+ config_dict = {
647
+ "name": "my_store",
648
+ "embedding": {"model_name": "custom-model"},
649
+ "search": {"default_top_k": 10}
650
+ }
651
+ config = load_config(config_dict)
652
+
653
+ # Validate configuration
654
+ config.validate()
655
+
656
+ # Save to file
657
+ config.save_yaml("output_config.yaml")
658
+ config.save_json("output_config.json")
659
+
660
+ # Convert to dictionary
661
+ config_dict = config.to_dict()
662
+ ```
663
+
664
+ ### Embedding Models
665
+
666
+ **Local Models (SentenceTransformers):**
667
+ - `Snowflake/snowflake-arctic-embed-m` (default, 768-dim, recommended)
668
+ - `sentence-transformers/all-MiniLM-L6-v2` (384-dim, fast)
669
+ - `BAAI/bge-large-en-v1.5` (1024-dim, high quality)
670
+ - `sentence-transformers/all-mpnet-base-v2` (768-dim, balanced)
671
+
672
+ **OpenAI API:**
673
+ - `text-embedding-3-small` (1536-dim)
674
+ - `text-embedding-3-large` (3072-dim)
675
+ - `text-embedding-ada-002` (1536-dim, legacy)
676
+
677
+ **Cohere API:**
678
+ - `embed-english-v3.0`
679
+ - `embed-multilingual-v3.0`
680
+
681
+ **Legacy PgVectorStore (deprecated):**
682
+ ```python
683
+ # Use local model
684
+ vector_store = PgVectorStore(
685
+ embedding_model="Snowflake/snowflake-arctic-embed-m",
686
+ use_embedding_api=False
687
+ )
688
+
689
+ # Use OpenAI API
690
+ vector_store = PgVectorStore(
691
+ use_embedding_api=True,
692
+ api_model="text-embedding-3-small"
693
+ )
694
+ ```
695
+
696
+ **ConfigurablePgVectorStore (recommended):**
697
+ ```python
698
+ from ai_vectorstore.config import VectorStoreConfig, EmbeddingConfig
699
+ from ai_vectorstore.components.vectorstore.configurable_pg_vector_store import ConfigurablePgVectorStore
700
+
701
+ # Local SentenceTransformer model
702
+ config = VectorStoreConfig(
703
+ embedding=EmbeddingConfig(
704
+ model_type="sentence_transformer",
705
+ model_name="Snowflake/snowflake-arctic-embed-m",
706
+ batch_size=32,
707
+ normalize=True
708
+ )
709
+ )
710
+ vector_store = ConfigurablePgVectorStore(config=config)
711
+
712
+ # OpenAI API
713
+ config = VectorStoreConfig(
714
+ embedding=EmbeddingConfig(
715
+ model_type="openai",
716
+ model_name="text-embedding-3-small",
717
+ api_key="your-api-key" # Or set OPENAI_API_KEY env var
718
+ )
719
+ )
720
+ vector_store = ConfigurablePgVectorStore(config=config)
721
+
722
+ # Cohere API
723
+ config = VectorStoreConfig(
724
+ embedding=EmbeddingConfig(
725
+ model_type="cohere",
726
+ model_name="embed-english-v3.0",
727
+ api_key="your-api-key" # Or set COHERE_API_KEY env var
728
+ )
729
+ )
730
+ vector_store = ConfigurablePgVectorStore(config=config)
731
+ ```
732
+
733
+ ## Examples
734
+
735
+ Comprehensive examples are available in the `examples/ai_vectorstore_examples/` directory:
736
+
737
+ - **`postgres_vectorstore_example.py`** - Full PostgreSQL implementation with hybrid search
738
+ - **`basic_faiss_example.py`** - FAISS in-memory vector store
739
+ - **`run_postgres_example.sh`** - One-command setup script
740
+
741
+ See [examples/ai_vectorstore_examples/README.md](../../examples/ai_vectorstore_examples/README.md) for detailed documentation.
742
+
743
+ ## Performance Considerations
744
+
745
+ ### PostgreSQL
746
+ - **Indexing**: Uses pgvector's HNSW (Hierarchical Navigable Small World) index for fast approximate nearest neighbor search
747
+ - **Caching**: LRU cache for query results (configurable cache size)
748
+ - **Batch Operations**: Use `add_nodes_batch()` for bulk inserts
749
+ - **Connection Pooling**: Configure Django database connection pooling for production
750
+
751
+ ### FAISS
752
+ - **Index Types**: Automatically uses IndexFlatL2 for accuracy; can be customized for larger datasets
753
+ - **Memory Usage**: Entire index in memory; monitor RAM usage with large datasets
754
+ - **Persistence**: Save/load operations can be expensive; use sparingly
755
+
756
+ ## Best Practices
757
+
758
+ 1. **Choose the Right Backend:**
759
+ - Use `ConfigurablePgVectorStore` for production deployments (recommended)
760
+ - Use legacy `PgVectorStore` for backward compatibility
761
+ - Use FAISS for prototyping, maximum speed, and memory-based applications
762
+ - Leverage configuration files for reproducible deployments
763
+
764
+ 2. **Configuration Management:**
765
+ - Use YAML/JSON configuration files for production
766
+ - Store configurations in version control
767
+ - Use environment variables for secrets (API keys, passwords)
768
+ - Validate configurations before deployment
769
+ - Document configuration changes
770
+
771
+ 3. **Optimize Embedding Models:**
772
+ - Start with `Snowflake/snowflake-arctic-embed-m` for balanced performance
773
+ - Use smaller models (384-dim) for speed-critical applications
774
+ - Use larger models (1024-dim+) for maximum accuracy
775
+ - Consider API-based models (OpenAI, Cohere) for convenience
776
+ - Batch encode documents for better throughput
777
+
778
+ 4. **Leverage Hybrid Search:**
779
+ - Use `hybrid_alpha=0.7` as starting point (favors vector search)
780
+ - Tune alpha based on your specific use case:
781
+ - 0.9-1.0: Pure semantic search
782
+ - 0.5-0.7: Balanced approach
783
+ - 0.0-0.3: Keyword-focused search
784
+ - Enable re-ranking for best results (adds latency but improves relevance)
785
+ - Set `search_buffer_factor=2` to retrieve more candidates for re-ranking
786
+
787
+ 5. **Metadata Design:**
788
+ - Store searchable metadata in custom fields
789
+ - Use consistent metadata structure across documents
790
+ - Index frequently queried fields
791
+ - Include version and timestamp metadata
792
+ - Use metadata for filtering and access control
793
+
794
+ 6. **Chunking Strategy:**
795
+ - Keep chunks between 200-500 tokens for optimal retrieval
796
+ - Include overlap (50-100 tokens) between chunks for context preservation
797
+ - Store chunk position for result ordering
798
+ - Consider semantic chunking for better coherence
799
+ - Configure `chunk_size` and `chunk_overlap` in IndexConfig
800
+
801
+ 7. **Update and Maintenance:**
802
+ - Use `update_vector()` for content corrections without re-indexing
803
+ - Periodically update embeddings with improved models
804
+ - Monitor and clean up outdated nodes
805
+ - Use batch operations for large-scale updates
806
+ - Keep track of node IDs for efficient updates
807
+
808
+ 8. **Performance Optimization:**
809
+ - Enable caching with appropriate `cache_size` (default: 1000)
810
+ - Use batch inserts with optimal `batch_insert_size` (default: 100)
811
+ - Monitor database connection pool settings
812
+ - Consider parallel processing for large datasets
813
+ - Use appropriate similarity metric for your use case:
814
+ - `cosine`: Normalized vectors (recommended for most cases)
815
+ - `l2`: Euclidean distance (good for non-normalized vectors)
816
+ - `dot_product`: Fast but requires normalized vectors
817
+
818
+ ## Integration with RAG Pipeline
819
+
820
+ ```python
821
+ from ai_vectorstore.components.loader.file_loader import FileLoader
822
+ from ai_vectorstore.components.chunker.simple_chunker import SimpleChunker
823
+ from ai_vectorstore.components.embedding_model.openai_embeddings import OpenAIEmbeddings
824
+ from ai_vectorstore.components.vectorstore.pg_vector_store import PgVectorStore
825
+ from ai_vectorstore.components.retriever.basic_retriever import BasicRetriever
826
+
827
+ # 1. Load documents
828
+ loader = FileLoader()
829
+ documents = loader.load("/path/to/documents")
830
+
831
+ # 2. Chunk documents
832
+ chunker = SimpleChunker()
833
+ chunks = chunker.run(documents)
834
+
835
+ # 3. Generate embeddings and store
836
+ vector_store = PgVectorStore()
837
+ vector_store.create_collection("knowledge_base")
838
+ vector_store.add_nodes(chunks, "knowledge_base")
839
+
840
+ # 4. Retrieve relevant context
841
+ retriever = BasicRetriever(vector_store=vector_store)
842
+ relevant_docs = retriever.retrieve(
843
+ query="user question",
844
+ collection_name="knowledge_base",
845
+ top_k=5
846
+ )
847
+
848
+ # 5. Use with your LLM
849
+ # context = "\n\n".join([doc.content for doc in relevant_docs])
850
+ # response = llm.generate(prompt + context)
851
+ ```
852
+
853
+ ## Troubleshooting
854
+
855
+ ### PostgreSQL Connection Issues
856
+ ```bash
857
+ # Check if PostgreSQL is running
858
+ docker ps | grep postgres
859
+
860
+ # Check logs
861
+ docker logs postgres-vectorstore
862
+
863
+ # Verify connection
864
+ psql -h localhost -U postgres -d vectorstore_db
865
+ ```
866
+
867
+ ### Embedding Model Issues
868
+ ```bash
869
+ # Clear cache and re-download
870
+ rm -rf ~/.cache/huggingface/hub
871
+
872
+ # Verify GPU availability (for GPU models)
873
+ python -c "import torch; print(torch.cuda.is_available())"
874
+ ```
875
+
876
+ ### Migration Issues
877
+ ```bash
878
+ # Reset migrations (development only!)
879
+ python manage.py migrate ai_vectorstore zero
880
+ python manage.py migrate
881
+ ```
882
+
883
+ ## API Reference
884
+
885
+ ### ConfigurablePgVectorStore (Recommended)
886
+
887
+ **Initialization:**
888
+ - `__init__(name: str, config: Union[VectorStoreConfig, Dict, str])` - Initialize with configuration
889
+ - `setup()` - Setup resources and connections
890
+
891
+ **Collection Management:**
892
+ - `get_or_create_collection(collection_name: str, embedding_dim: int)` - Get or create collection
893
+ - `create_collection_from_nodes(collection_name: str, nodes: List[Node])` - Create from nodes
894
+ - `create_collection_from_files(collection_name: str, files: List[VSFile])` - Create from files
895
+ - `list_collections()` - List all collections
896
+ - `get_collection_info(collection_name: str)` - Get collection information
897
+ - `delete_collection(collection_name: str)` - Delete a collection
898
+
899
+ **Node Operations:**
900
+ - `add_nodes(collection_name: str, nodes: List[Node])` - Add nodes to collection
901
+ - `update_vector(collection_name: str, node_id: int, new_content: str, new_embedding: List[float], new_metadata: Dict)` - Update existing vector
902
+ - `delete_nodes(collection_name: str, node_ids: List[int])` - Delete specific nodes
903
+
904
+ **Search:**
905
+ - `search(collection_name: str, query: str, distance_type: str, number: int, meta_data_filters: Dict, hybrid_search: bool)` - Search collection with hybrid support
906
+ - `query(vector: List[float], top_k: int, **kwargs)` - Query by vector (VectorStore interface)
907
+
908
+ **Lifecycle:**
909
+ - `shutdown()` - Cleanup and shutdown
910
+
911
+ ### PgVectorStore (Legacy)
912
+ - `create_collection(name: str)` - Create a new collection
913
+ - `add_nodes(nodes: List[Node], collection_name: str)` - Add nodes to collection
914
+ - `search(query: str, collection_name: str, search_type: str, top_k: int)` - Search collection
915
+ - `delete_collection(name: str)` - Delete a collection
916
+ - `list_collections()` - List all collections
917
+ - `get_collection_stats(name: str)` - Get collection statistics
918
+
919
+ ### FaissStore
920
+ - `create_collection(name: str)` - Create a new collection
921
+ - `add_nodes(nodes: List[Node], collection_name: str)` - Add nodes to collection
922
+ - `query_nodes(query: str, collection_name: str, top_k: int)` - Query collection
923
+ - `save_vector_store()` - Save indexes to disk
924
+ - `load_vector_store()` - Load indexes from disk
925
+
926
+ ### Configuration API
927
+
928
+ **VectorStoreConfig:**
929
+ - `from_dict(config_dict: Dict)` - Create from dictionary
930
+ - `from_yaml(yaml_path: str)` - Load from YAML file
931
+ - `from_json(json_path: str)` - Load from JSON file
932
+ - `to_dict()` - Convert to dictionary
933
+ - `save_yaml(output_path: str)` - Save to YAML file
934
+ - `save_json(output_path: str)` - Save to JSON file
935
+ - `validate()` - Validate configuration
936
+
937
+ **load_config:**
938
+ - `load_config(config_source: Union[str, Dict], config_type: str)` - Universal config loader
939
+
940
+ ## Migration Guide
941
+
942
+ ### From PgVectorStore to ConfigurablePgVectorStore
943
+
944
+ If you're using the legacy `PgVectorStore`, here's how to migrate to the new `ConfigurablePgVectorStore`:
945
+
946
+ **Before (Legacy):**
947
+ ```python
948
+ from ai_vectorstore.components.vectorstore.pg_vector_store import PgVectorStore
949
+
950
+ vector_store = PgVectorStore(
951
+ name="my_store",
952
+ embedding_model="Snowflake/snowflake-arctic-embed-m",
953
+ use_embedding_api=False
954
+ )
955
+
956
+ vector_store.create_collection("docs")
957
+ vector_store.add_nodes(nodes, "docs")
958
+ results = vector_store.search(
959
+ query="test",
960
+ collection_name="docs",
961
+ search_type="hybrid",
962
+ top_k=5
963
+ )
964
+ ```
965
+
966
+ **After (Configurable):**
967
+ ```python
968
+ from ai_vectorstore.components.vectorstore.configurable_pg_vector_store import ConfigurablePgVectorStore
969
+ from ai_vectorstore.config import VectorStoreConfig, EmbeddingConfig
970
+
971
+ # Option 1: Simple migration with defaults
972
+ vector_store = ConfigurablePgVectorStore(name="my_store")
973
+ vector_store.setup()
974
+
975
+ # Option 2: With explicit configuration
976
+ config = VectorStoreConfig(
977
+ embedding=EmbeddingConfig(
978
+ model_type="sentence_transformer",
979
+ model_name="Snowflake/snowflake-arctic-embed-m"
980
+ )
981
+ )
982
+ vector_store = ConfigurablePgVectorStore(name="my_store", config=config)
983
+ vector_store.setup()
984
+
985
+ # Collections are auto-created
986
+ vector_store.get_or_create_collection("docs")
987
+ vector_store.add_nodes("docs", nodes)
988
+
989
+ # Search (note: different return format)
990
+ results_dict, result_nodes = vector_store.search(
991
+ collection_name="docs",
992
+ query="test",
993
+ hybrid_search=True,
994
+ number=5
995
+ )
996
+ ```
997
+
998
+ **Key Differences:**
999
+ 1. Must call `setup()` after initialization
1000
+ 2. Use `get_or_create_collection()` instead of `create_collection()`
1001
+ 3. `search()` returns tuple of (results_dict, nodes_list)
1002
+ 4. Search parameters: `top_k` → `number`, `search_type` → `hybrid_search` (bool)
1003
+ 5. Configuration is centralized and supports YAML/JSON files
1004
+ 6. New features: `update_vector()`, `delete_nodes()`, `get_collection_info()`
1005
+
1006
+ ## Contributing
1007
+
1008
+ This package is part of the Rakam Systems framework. For contributions and development:
1009
+
1010
+ 1. Follow the existing interface patterns in `ai_core.interfaces`
1011
+ 2. Add comprehensive docstrings and type hints
1012
+ 3. Include examples in the `examples/` directory
1013
+ 4. Update this README with new features
1014
+ 5. Test with both legacy and configurable vector stores
1015
+ 6. Ensure backward compatibility when possible
1016
+
1017
+ ## License
1018
+
1019
+ See [LICENSE](LICENSE) file for details.
1020
+
1021
+ ## Additional Resources
1022
+
1023
+ ### Rakam Systems Documentation
1024
+ - [Main Rakam Systems Documentation](../../README.md)
1025
+ - [AI Core Interfaces](../ai_core/interfaces/)
1026
+ - [AI Agents Framework](../ai_agents/)
1027
+ - [Examples Directory](../../examples/ai_vectorstore_examples/)
1028
+
1029
+ ### External Documentation
1030
+ - [pgvector Documentation](https://github.com/pgvector/pgvector) - PostgreSQL vector extension
1031
+ - [FAISS Documentation](https://github.com/facebookresearch/faiss) - Facebook AI Similarity Search
1032
+ - [SentenceTransformers Documentation](https://www.sbert.net/) - Sentence embedding models
1033
+ - [OpenAI Embeddings API](https://platform.openai.com/docs/guides/embeddings) - OpenAI embedding models
1034
+ - [Cohere Embeddings API](https://docs.cohere.com/docs/embeddings) - Cohere embedding models
1035
+
1036
+ ### Configuration Examples
1037
+ - See [examples/configs/pg_vectorstore_config.yaml](../../examples/configs/pg_vectorstore_config.yaml) for example YAML configuration
1038
+ - See [examples/ai_vectorstore_examples/](../../examples/ai_vectorstore_examples/) for working examples
1039
+
1040
+ ### Key Files
1041
+ - `config.py` - Configuration system implementation
1042
+ - `core.py` - Core data structures (VSFile, Node, NodeMetadata)
1043
+ - `components/vectorstore/configurable_pg_vector_store.py` - Enhanced configurable vector store
1044
+ - `components/vectorstore/pg_vector_store.py` - Legacy vector store
1045
+ - `components/embedding_model/configurable_embeddings.py` - Pluggable embedding system
1046
+
1047
+ ## Support and Contributing
1048
+
1049
+ For issues, questions, or contributions, please refer to the main Rakam Systems repository. When reporting issues:
1050
+
1051
+ 1. Include your configuration (sanitized of secrets)
1052
+ 2. Provide code snippets demonstrating the issue
1053
+ 3. Include error messages and stack traces
1054
+ 4. Specify versions of key dependencies (Django, pgvector, etc.)
1055
+
1056
+ ## Changelog
1057
+
1058
+ ### v1.1.0 (Latest)
1059
+ - Added `ConfigurablePgVectorStore` with full configuration support
1060
+ - Introduced centralized configuration system (`config.py`)
1061
+ - Added `ConfigurableEmbeddings` for pluggable embedding models
1062
+ - Implemented `update_vector()` for in-place updates
1063
+ - Added support for multiple similarity metrics
1064
+ - Enhanced hybrid search with configurable alpha parameter
1065
+ - Improved documentation with migration guide
1066
+
1067
+ ### v1.0.0
1068
+ - Initial release with `PgVectorStore` and `FaissStore`
1069
+ - Basic hybrid search support
1070
+ - Django ORM integration
1071
+ - Collection-based organization