PyPI - simplevecdb - Versions diffs - 2.3.0__tar.gz → 2.5.0__tar.gz - Mend

simplevecdb 2.3.0tar.gz → 2.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (119) hide show

{simplevecdb-2.3.0/docs → simplevecdb-2.5.0}/CHANGELOG.md RENAMED Viewed

@@ -5,106 +5,202 @@ All notable changes to SimpleVecDB will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
-## [2.2.0] - 2026-01-17
+## [2.5.0] - 2026-04-07
 ### Added
-- **Vector Clustering & Auto-Tagging** - Discover natural groupings in embeddings
-  - `VectorCollection.cluster()` - Cluster documents by semantic similarity
-    - **K-means**: Classic centroid-based clustering for balanced clusters
-    - **MiniBatch K-means**: Scalable variant for large datasets (default)
-    - **HDBSCAN**: Density-based clustering that auto-discovers cluster count
-  - `VectorCollection.auto_tag()` - Generate descriptive tags for clusters
-    - TF-IDF method (default): Extract keywords with highest TF-IDF scores
-    - Frequency method: Extract most common words per cluster
-    - Custom callback: Implement custom tagging logic (e.g., LLM-based)
-  - `VectorCollection.assign_cluster_metadata()` - Persist cluster IDs to document metadata
-  - `VectorCollection.get_cluster_members()` - Retrieve all documents in a cluster
-- **Cluster Quality Metrics** - Evaluate clustering results
-  - `ClusterResult.inertia` - Sum of squared distances to centroids (K-means only, lower is better)
-  - `ClusterResult.silhouette_score` - Cluster separation metric (-1 to 1, higher is better)
-  - `ClusterResult.metrics()` - Get all metrics as dictionary
-- **Cluster Persistence** - Save and reuse cluster configurations
-  - `VectorCollection.save_cluster()` - Save cluster centroids and metadata to database
-  - `VectorCollection.load_cluster()` - Load saved cluster configuration
-  - `VectorCollection.list_clusters()` - List all saved cluster configurations
-  - `VectorCollection.delete_cluster()` - Delete a saved cluster configuration
-  - `VectorCollection.assign_to_cluster()` - Assign new documents to saved clusters without re-clustering
-- **Async Clustering Support** - Full async/await parity for all clustering operations
-  - `AsyncVectorCollection.cluster()`, `auto_tag()`, `assign_cluster_metadata()`, `get_cluster_members()`
-  - `AsyncVectorCollection.save_cluster()`, `load_cluster()`, `list_clusters()`, `delete_cluster()`, `assign_to_cluster()`
-- **New Dependencies** - Now included in standard installation
-  - `scikit-learn>=1.3.0` - K-means, MiniBatch K-means, silhouette score
-  - `hdbscan>=0.8.33` - Density-based clustering
-  - `sqlcipher3-binary>=0.5.0` - Encryption support (previously optional)
-  - `cryptography>=41.0` - Encryption utilities (previously optional)
-- **Documentation**
-  - New comprehensive clustering guide: `docs/guides/clustering.md`
-    - Algorithm comparison and selection guide
-    - Quality metrics interpretation
-    - Cluster persistence workflows
-    - Use cases: product categorization, topic discovery, customer segmentation, duplicate detection
-    - Best practices and troubleshooting
-  - New types reference: `docs/api/types.md`
-    - Complete `ClusterResult` API documentation
-    - `Document`, `DistanceStrategy`, `Quantization`, `ClusterAlgorithm` reference
-  - Updated README.md and docs/index.md with clustering sections
-  - Enhanced `docs/api/core.md` with clustering examples
+- **`delete_collection(name)`** — drop a collection's SQLite tables, FTS index, and usearch file in one call. Available on both `VectorDB` and `AsyncVectorDB`.
+- **`store_embeddings` parameter** on `collection()` — opt into storing embedding BLOBs in SQLite (default `False`). Saves ~2x storage; MMR transparently fetches vectors from the usearch index when BLOBs are absent.
+- **`async_retry_on_lock` decorator** — async variant of `retry_on_lock` using `asyncio.sleep` instead of `time.sleep`, avoiding executor thread blocking.
+- **`file_lock` context manager** — advisory cross-process file locking (`fcntl`/`msvcrt`) for usearch index files. Prevents corruption from concurrent processes.
+- **`__repr__`** on `VectorDB`, `VectorCollection`, `AsyncVectorDB`, `AsyncVectorCollection` for debuggable string representations.
+- **FLOAT16 quantization** fully implemented in `serialize()`/`deserialize()` — was previously defined in the enum but raised `ValueError` at runtime.
+- **Pagination** on `get_documents(limit=, offset=)` and catalog methods (`find_ids_by_filter`, `find_ids_by_texts`) — previously returned unbounded result sets.
+- **Embeddings server enhancements:**
+  - Graceful shutdown with SIGTERM/SIGINT draining (10s timeout)
+  - CORS middleware with configurable origins for browser-based clients
+  - Model warm-up on startup (skip with `--no-warmup`)
+  - Input validation: rejects empty strings (422) and texts exceeding 100k chars (413)
+  - Proper `argparse` CLI with `--host`, `--port`, `--no-warmup`, `--help`
+  - Startup banner logging config summary (host, port, model, auth, rate limits)
+  - Nested token array normalization (`list[list[int]]` input format)
+  - Async executor offload for `embed_texts` (non-blocking event loop)
+  - OpenAPI version synced from package metadata
+  - Module `__init__.py` exports (`embed_texts`, `get_embedder`, `load_model`, `app`, `run_server`)
+### Fixed
+- **`delete_by_ids` ordering** — SQLite deletion now happens first (transactional, can rollback), then usearch. Previously usearch removed first, leaving orphaned catalog entries on SQLite failure.
+- **`_matches_filter` string semantics** — now uses exact equality, consistent with SQL `build_filter_clause`. Was using substring match (`value in str(meta_value)`).
+- **`list_collections`** — scans `sqlite_master` for persisted collection tables instead of returning only session-cached names. Works across reopened databases.
+- **WAL mode for encrypted databases** — `PRAGMA journal_mode=WAL` and `PRAGMA synchronous=NORMAL` now set for SQLCipher connections (was only set for unencrypted).
+- **`collection()` cache key** — includes `distance_strategy` and `quantization` in cache key (sync version). Previously cached by name only, silently ignoring differing params on cache hit.
+- **`_ensure_fts_table`** — retries up to 3 times on transient "database is locked" errors instead of permanently disabling FTS on first failure.
+- **Connection health check** — `SELECT 1` probe after connection creation; raises `RuntimeError` immediately on corrupt databases.
+### Improved
+- **Usearch batch operations** — `add()`, `remove()`, and `get()` now use batch usearch APIs instead of per-key loops. Significant speedup for large operations.
+- **Filtered search iterative deepening** — replaces fixed `k*3` overfetch with adaptive doubling (up to `k*30`). Highly selective filters now reliably return `k` results.
+- **Memory-map heuristic** — uses file size threshold (50MB) instead of inaccurate `file_size // 100` vector count estimate for mmap vs load decision.
+- **Apple chip detection** — uses `platform.processor()` instead of spawning a `sysctl` subprocess.
+### Removed
+- **Duplicate `_dim` property** — removed in favor of the public `dim` property.
+### Breaking Changes
+- String metadata filters now use exact equality (was substring match).
+- `store_embeddings` defaults to `False` — `rebuild_index()` requires `store_embeddings=True` or re-adding documents.
+## [2.4.0] - 2026-03-22
+### Added
+- **Public catalog API on VectorCollection + AsyncVectorCollection:**
+  - `get_documents(filter_dict=)` — replaces private `_catalog` access
+  - `get_embeddings_by_ids(ids)` — fetch stored embeddings
+  - `update_metadata(updates)` — batch metadata merge
+  - `count()`, `save()`, `dim` property — async wrappers
+  - `add_texts(parent_ids=, threads=)` — full param support on async
+  - `rebuild_index`, `get_children/parent/descendants/ancestors`, `set_parent` — async hierarchy API
+- **Executor injection on AsyncVectorDB** — accept optional `executor` keyword argument so consumers can share a single-threaded executor for ONNX/usearch thread safety; `close()` only shuts down executor when `_owns_executor` is True
+- **Safety constants** in `constants.py`: `SEARCH_COLLECTION_TIMEOUT`, `EXECUTOR_SHUTDOWN_TIMEOUT`, `MAX_HIERARCHY_DEPTH`
+### Fixed
+- **VectorDB.close()** now calls `conn.close()` — was leaking file descriptors when `save()` succeeded but connection was never closed
+- **VectorDB.close()** wraps `save()` in `try/finally` so `conn.close()` always runs even if index serialization fails
+- **add_documents ID recovery** uses `last_insert_rowid()` arithmetic instead of `ORDER BY id DESC LIMIT N`, which raced under concurrent inserts
+- **String metadata filter** uses exact equality (`=`) instead of `LIKE` substring match — `{"type": "doc"}` no longer matches `"markdown_doc"`
+- **update_metadata_batch** wrapped in single transaction (`with self.conn`) to prevent partial commits on crash
+- **rebuild_index** uses `if x is not None` instead of `x or default` so passing `connectivity=0` no longer silently uses the default
+- **search_collections** parallel futures now have a 30s timeout — one hung collection can no longer block the entire cross-collection search
+- **AsyncVectorDB.close()** uses `shutdown(wait=False, cancel_futures=True)` instead of blocking `shutdown(wait=True)` which could hang forever on stuck tasks
+- **Recursive CTE safety cap** — `get_descendants`/`get_ancestors` apply `MAX_HIERARCHY_DEPTH=100` when `max_depth=None` to prevent infinite recursion from parent_id cycles
+- **RateLimiter cleanup** capped to 500 evictions per call to bound lock hold time under high bucket counts
+- **HuggingFace download** now uses `etag_timeout=30` with local-cache fallback on network failure
+- **embed_texts** rejects batches over 10,000 texts to prevent unbounded CPU time
+- **retry_on_lock** adds `total_timeout=10s` budget — gives up early if cumulative sleep would exceed the budget
 ### Changed
-- **pyproject.toml**: Updated `scikit-learn` minimum version from `1.0` to `1.3.0` for improved clustering stability
+- **`__version__`** now read from package metadata via `importlib.metadata` (single source of truth in `pyproject.toml`)
+- **Upsert in usearch_index** separates conflict detection from removal for clearer flow
-### Testing
+## [2.3.0] - 2026-03-08
-- Added 26 clustering tests in `tests/unit/test_clustering.py`:
-  - 16 core clustering tests (algorithms, auto-tagging, metadata persistence, edge cases)
-  - 4 cluster metrics tests (inertia, silhouette, metrics method)
-  - 6 cluster persistence tests (save/load/list/delete/assign)
-- Added 3 async clustering tests in `tests/unit/test_async.py`
-- Total test count: 305 (up from 292)
+### Breaking Changes
-### Installation
+- **Integration dependencies are now optional.** LangChain and LlamaIndex packages are no longer installed by default. Install with `pip install simplevecdb[integrations]` to use them. Existing users upgrading from v2.2.x will see a clear ImportError with migration instructions.
-Clustering and encryption are now included by default:
+### Added
-```bash
-pip install simplevecdb
-```
+- **`[integrations]` optional extra** — Install LangChain and LlamaIndex dependencies only when needed, reducing default install footprint
+- **Runtime import guards** in integration modules with v2.3.0 migration messaging
+- **Lazy `__getattr__` loading** in `integrations/__init__.py` — integration classes are only imported when accessed
+- **Input validation guards** on search methods:
+  - `similarity_search`, `similarity_search_batch`, `keyword_search`, `hybrid_search` now reject `k <= 0`
+  - `add_texts` validates length consistency of `metadatas`, `embeddings`, `ids`, and `parent_ids` against `texts`
+- **NaN/Inf validation** for float values in metadata filters (`utils.validate_filter`)
+- **Empty list rejection** for list filter values
+- **Double-close protection** on `VectorDB` with `_closed` flag
+- **Context manager protocol** (`__enter__`/`__exit__`) on `VectorDB`
+- **Table name validation** in `check_migration` (defense-in-depth against SQL injection)
+- **Graceful per-future error handling** in `search_collections`
+- **Adaptive batch search threshold** — queries below `USEARCH_BATCH_THRESHOLD` (10) use sequential search to avoid batch overhead
-No extra installation steps required!
+### Changed
-### Example
+- **Python dev target changed to 3.12** (`.python-version`), `requires-python` remains `>= "3.10"`
+- **Version bumped to 2.3.0**
+- **Performance: MMR search vectorized** — pre-normalize embeddings once, use `sel_matrix @ emb` matrix-vector multiply instead of Python inner loop, O(1) `list.pop` replaces O(n) `list.remove`, hoist `1 - lambda_mult` loop invariant
+- **Performance: merged SQL round-trips in MMR** — new `get_documents_and_embeddings_by_ids` fetches text, metadata, and embeddings in a single query (previously two separate SELECTs)
+- **Performance: `get_parent` collapsed** from 2 sequential SELECTs to 1 self-JOIN
+- **Performance: `add_documents` ID recovery** — skip redundant `SELECT ORDER BY DESC` when explicit IDs are provided; removed unnecessary `list(texts)` copy
+- **Performance: FLOAT serialization** — `np.asarray().tobytes()` replaces `struct.pack` with per-element Python loop (single C memcpy)
+- **Performance: `np.array` → `np.asarray`** on every search and insert path to avoid unnecessary copies
+- **Performance: SQL placeholder strings** — `",".join(["?"] * len(ids))` replaces generator expression across all 9 call sites
+- **Performance: batched numpy conversion** in `add_texts` — single `np.asarray` call instead of per-item conversion
+- **Performance: compact JSON separators** in catalog serialization
+- **Performance: deduplicated `.tolist()` calls** in search engine
+- **Performance: `np.unique(ravel())`** for batch key collection in `similarity_search_batch`
+- **Performance: usearch upsert** — skip contains-check loop on empty index, cache `int(key)` once per iteration
+- **Performance: cluster table DDL** — `_cluster_table_ready` flag skips `CREATE TABLE IF NOT EXISTS` on repeated calls; cached `_cluster_table_name`
+- **`_normalize_key`** now delegates to `_derive_key` instead of duplicating PBKDF2 logic
+- **HNSW defaults** in `usearch_index.py` now sourced from `constants.py` (removed local duplicates)
+- **Collection name regex** uses `constants.COLLECTION_NAME_PATTERN` instead of hardcoded pattern
+- **`VectorDB` defaults** for `distance_strategy` and `quantization` sourced from `constants.DEFAULT_DISTANCE_STRATEGY` / `constants.DEFAULT_QUANTIZATION`
+- **`_batched` utility** moved from `core.py` to `utils.py` for reuse; now used in `catalog.py` batch updates
+- **`auto_tag`** uses `defaultdict(list)` instead of manual if-not-in pattern
+- **`import random`** hoisted to module level in `utils.py` (was inside retry loop)
+- **Streaming placeholder bug fixed** — `_process_streaming_batch` now correctly detects `None` placeholders (previously used empty list `[]`, preventing auto-embedding replacement)
+- **README updated** to document `pip install simplevecdb[integrations]` installation
+### Removed
+- LangChain and LlamaIndex packages from core `[project.dependencies]` (moved to `[project.optional-dependencies] integrations`)
+- Duplicated HNSW default constants from `usearch_index.py` (now single source in `constants.py`)
+- Unused `struct` import from `quantization.py`
+- Unused `itertools` import from `core.py`
+## [2.2.1] - 2026-01-27
-```python
-from simplevecdb import VectorDB
+### Changed
-db = VectorDB("products.db")
-collection = db.collection("items")
+- Moved integration dependencies (langchain-core, langchain-openai, llama-index) from dev to main dependencies for easier installation
+- Added bandit to dev dependencies for security linting in pre-commit
+- Cleaned up duplicate dev dependency definitions
-# Cluster documents
-result = collection.cluster(n_clusters=5, algorithm="minibatch_kmeans")
+## [2.2.0] - 2026-01-26
-# Generate tags and persist
-tags = collection.auto_tag(result, method="tfidf", n_keywords=3)
-collection.assign_cluster_metadata(result, tags)
+### Added
-# Save for fast assignment of new documents
-collection.save_cluster("categories", result, metadata={"tags": tags})
+- Version 2.2.0 release
-# Later: assign new documents without re-clustering
-new_ids = collection.add_texts(new_texts, embeddings=new_embeddings)
-collection.assign_to_cluster("categories", new_ids)
+## [2.1.0] - 2026-01-01
-# Evaluate quality
-print(f"Silhouette Score: {result.silhouette_score:.2f}")  # 0.62
-print(f"Inertia: {result.inertia:.2f}")  # 1523.45
-```
+### Added
+- **SQLCipher Encryption Support** - Full at-rest encryption for sensitive data:
+  - `VectorDB(path, encryption_key="...")` enables AES-256 page-level database encryption
+  - Uses SQLCipher for transparent SQLite encryption (PRAGMA key)
+  - Usearch index files encrypted with AES-256-GCM (`.usearch.enc`)
+  - Zero performance overhead during search (decrypt on load, encrypt on save only)
+  - Key derivation: PBKDF2-SHA256 with 480,000 iterations for passphrases
+  - Install with `pip install simplevecdb[encryption]`
+- **New encryption module** (`simplevecdb.encryption`):
+  - `create_encrypted_connection()` - SQLCipher connection factory
+  - `is_database_encrypted()` - Check if a database file is encrypted
+  - `encrypt_index_file()` / `decrypt_index_file()` - Index file encryption
+  - `EncryptionError` / `EncryptionUnavailableError` - New exception types
+- **Streaming Insert API** - Memory-efficient large-scale ingestion:
+  - `collection.add_texts_streaming(iterable)` - Process from any iterator/generator
+  - Configurable `batch_size` parameter (default: config.EMBEDDING_BATCH_SIZE)
+  - Yields `StreamingProgress` after each batch for monitoring
+  - Optional `on_progress` callback for custom logging/UI updates
+  - New types: `StreamingProgress`, `ProgressCallback`
+- **Hierarchical Document Relationships** - Parent/child document structure:
+  - `parent_ids` parameter in `add_texts()` to link documents
+  - `get_children(doc_id)` - Get direct child documents
+  - `get_parent(doc_id)` - Get parent document
+  - `get_descendants(doc_id, max_depth)` - Recursive children traversal
+  - `get_ancestors(doc_id, max_depth)` - Path to root
+  - `set_parent(doc_id, parent_id)` - Update relationships
+  - Uses SQLite recursive CTE for efficient traversal
+  - Auto-migrates existing databases (adds `parent_id` column)
+### Changed
+- `check_migration()` now gracefully handles encrypted databases (returns `needs_migration=False`)
+### Dependencies
+- New optional dependency group `[encryption]`: `sqlcipher3-binary>=0.5.0`, `cryptography>=41.0`
 ## [2.0.0] - 2025-12-23
@@ -473,6 +569,12 @@ Benchmarks on i9-13900K & RTX 4090 with 10k vectors (384-dim):
 - **Documentation**: https://coderdayton.github.io/simplevecdb/
 - **License**: MIT
+[2.4.0]: https://github.com/coderdayton/simplevecdb/releases/tag/v2.4.0
+[2.3.0]: https://github.com/coderdayton/simplevecdb/releases/tag/v2.3.0
+[2.2.1]: https://github.com/coderdayton/simplevecdb/releases/tag/v2.2.1
+[2.2.0]: https://github.com/coderdayton/simplevecdb/releases/tag/v2.2.0
+[2.1.0]: https://github.com/coderdayton/simplevecdb/releases/tag/v2.1.0
+[2.0.0]: https://github.com/coderdayton/simplevecdb/releases/tag/v2.0.0
 [1.3.0]: https://github.com/coderdayton/simplevecdb/releases/tag/v1.3.0
 [1.2.0]: https://github.com/coderdayton/simplevecdb/releases/tag/v1.2.0
 [1.1.1]: https://github.com/coderdayton/simplevecdb/releases/tag/v1.1.1

{simplevecdb-2.3.0 → simplevecdb-2.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: simplevecdb
-Version: 2.3.0
+Version: 2.5.0
 Summary: Dead-simple local vector database powered by usearch HNSW.
 Author-email: Dayton Dunbar <coderdayton14@gmail.com>
 License: MIT
@@ -43,7 +43,7 @@ SimpleVecDB brings **Chroma-like simplicity** to a single **SQLite file**. Built
 - **Zero Infrastructure** — Just a `.db` file. No Docker, no Redis, no cloud bills.
 - **Blazing Fast** — 10-100x faster search via usearch HNSW. Adaptive: brute-force for <10k vectors (perfect recall), HNSW for larger collections.
 - **Truly Portable** — Runs anywhere SQLite runs: Linux, macOS, Windows, even WASM.
-- **Async Ready** — Full async/await support for web servers and concurrent workloads.
+- **Async Ready** — Full async/await support with optional executor injection for thread-safe ONNX/usearch sharing.
 - **Batteries Included** — Optional FastAPI embeddings server + LangChain/LlamaIndex integrations via `[integrations]` extra.
 - **Production Ready** — Hybrid search (BM25 + vector), metadata filtering, multi-collection support, and automatic hardware acceleration.
@@ -169,10 +169,13 @@ hybrid = collection.hybrid_search("powerhouse cell", k=2)
 **Optional: Run embeddings server (OpenAI-compatible)**
 ```bash
-simplevecdb-server --port 8000
+simplevecdb-server --port 8000                # Default model, auto warm-up
+simplevecdb-server --host 0.0.0.0 --port 9000 # Bind to all interfaces
+simplevecdb-server --no-warmup                # Skip model preload on startup
+simplevecdb-server --help                     # Show all options
 ```
-See [Setup Guide](ENV_SETUP.md) for configuration: model registry, rate limits, API keys, CUDA optimization.
+See [Setup Guide](ENV_SETUP.md) for configuration: model registry, rate limits, API keys, CORS, CUDA optimization.
 ### Option 3: With LangChain or LlamaIndex
@@ -321,6 +324,36 @@ parent = collection.get_parent(child_ids[0])
 descendants = collection.get_descendants(parent_ids[0])
 ```
+### Document Management (v2.4+)
+Query and update documents without touching private internals:
+```python
+# Get all documents (with optional metadata filter)
+docs = collection.get_documents(filter_dict={"category": "tech"})
+for doc_id, text, metadata in docs:
+    print(f"[{doc_id}] {text[:50]}...")
+# Paginated access (v2.5+)
+page1 = collection.get_documents(limit=100)
+page2 = collection.get_documents(limit=100, offset=100)
+# Fetch stored embeddings
+embeddings = collection.get_embeddings_by_ids([1, 2, 3])
+# Batch update metadata (shallow merge)
+collection.update_metadata([
+    (1, {"reviewed": True}),
+    (2, {"reviewed": True, "score": 0.95}),
+])
+# Quick stats
+print(f"Collection has {collection.count()} documents, dim={collection.dim}")
+# Delete an entire collection (v2.5+)
+db.delete_collection("old_data")
+```
 ### Vector Clustering (v2.2+)
 Discover natural groupings in your embeddings:
@@ -359,6 +392,12 @@ Supports K-means, MiniBatch K-means, and HDBSCAN. See [Clustering Guide](https:/
 | **Document Hierarchies**  | ✅     | Parent/child relationships for chunked docs                  |
 | **Vector Clustering**     | ✅     | K-means, MiniBatch K-means, HDBSCAN with auto-tagging (v2.2+) |
 | **Cluster Persistence**   | ✅     | Save/load cluster centroids for fast assignment (v2.2+)      |
+| **Public Catalog API**    | ✅     | `get_documents`, `get_embeddings_by_ids`, `update_metadata` (v2.4+) |
+| **Executor Injection**    | ✅     | Share thread pool across async instances for ONNX safety (v2.4+) |
+| **Collection Management** | ✅     | `delete_collection()`, paginated `get_documents(limit=, offset=)` (v2.5+) |
+| **Cross-Process Safety**  | ✅     | Advisory file locking on usearch index files (v2.5+) |
+| **FLOAT16 Quantization**  | ✅     | Half-precision storage with 2x compression (v2.5+) |
+| **Embeddings Server**     | ✅     | CORS, graceful shutdown, input validation, model warm-up (v2.5+) |
 ## Performance Benchmarks
@@ -429,6 +468,11 @@ pip install torch --index-url https://download.pytorch.org/whl/cu118
 - [x] Hierarchical document relationships (parent/child)
 - [x] Cross-collection search
 - [x] Vector clustering and auto-tagging (v2.2)
+- [x] Public catalog API for document management (v2.4)
+- [x] Async executor injection for thread-safe sharing (v2.4)
+- [x] Collection management: `delete_collection()`, pagination (v2.5)
+- [x] Cross-process file locking and connection health checks (v2.5)
+- [x] Embeddings server hardening: CORS, graceful shutdown, input validation (v2.5)
 - [ ] Incremental clustering (online learning)
 - [ ] Cluster visualization exports

{simplevecdb-2.3.0 → simplevecdb-2.5.0}/README.md RENAMED Viewed

@@ -14,7 +14,7 @@ SimpleVecDB brings **Chroma-like simplicity** to a single **SQLite file**. Built
 - **Zero Infrastructure** — Just a `.db` file. No Docker, no Redis, no cloud bills.
 - **Blazing Fast** — 10-100x faster search via usearch HNSW. Adaptive: brute-force for <10k vectors (perfect recall), HNSW for larger collections.
 - **Truly Portable** — Runs anywhere SQLite runs: Linux, macOS, Windows, even WASM.
-- **Async Ready** — Full async/await support for web servers and concurrent workloads.
+- **Async Ready** — Full async/await support with optional executor injection for thread-safe ONNX/usearch sharing.
 - **Batteries Included** — Optional FastAPI embeddings server + LangChain/LlamaIndex integrations via `[integrations]` extra.
 - **Production Ready** — Hybrid search (BM25 + vector), metadata filtering, multi-collection support, and automatic hardware acceleration.
@@ -140,10 +140,13 @@ hybrid = collection.hybrid_search("powerhouse cell", k=2)
 **Optional: Run embeddings server (OpenAI-compatible)**
 ```bash
-simplevecdb-server --port 8000
+simplevecdb-server --port 8000                # Default model, auto warm-up
+simplevecdb-server --host 0.0.0.0 --port 9000 # Bind to all interfaces
+simplevecdb-server --no-warmup                # Skip model preload on startup
+simplevecdb-server --help                     # Show all options
 ```
-See [Setup Guide](ENV_SETUP.md) for configuration: model registry, rate limits, API keys, CUDA optimization.
+See [Setup Guide](ENV_SETUP.md) for configuration: model registry, rate limits, API keys, CORS, CUDA optimization.
 ### Option 3: With LangChain or LlamaIndex
@@ -292,6 +295,36 @@ parent = collection.get_parent(child_ids[0])
 descendants = collection.get_descendants(parent_ids[0])
 ```
+### Document Management (v2.4+)
+Query and update documents without touching private internals:
+```python
+# Get all documents (with optional metadata filter)
+docs = collection.get_documents(filter_dict={"category": "tech"})
+for doc_id, text, metadata in docs:
+    print(f"[{doc_id}] {text[:50]}...")
+# Paginated access (v2.5+)
+page1 = collection.get_documents(limit=100)
+page2 = collection.get_documents(limit=100, offset=100)
+# Fetch stored embeddings
+embeddings = collection.get_embeddings_by_ids([1, 2, 3])
+# Batch update metadata (shallow merge)
+collection.update_metadata([
+    (1, {"reviewed": True}),
+    (2, {"reviewed": True, "score": 0.95}),
+])
+# Quick stats
+print(f"Collection has {collection.count()} documents, dim={collection.dim}")
+# Delete an entire collection (v2.5+)
+db.delete_collection("old_data")
+```
 ### Vector Clustering (v2.2+)
 Discover natural groupings in your embeddings:
@@ -330,6 +363,12 @@ Supports K-means, MiniBatch K-means, and HDBSCAN. See [Clustering Guide](https:/
 | **Document Hierarchies**  | ✅     | Parent/child relationships for chunked docs                  |
 | **Vector Clustering**     | ✅     | K-means, MiniBatch K-means, HDBSCAN with auto-tagging (v2.2+) |
 | **Cluster Persistence**   | ✅     | Save/load cluster centroids for fast assignment (v2.2+)      |
+| **Public Catalog API**    | ✅     | `get_documents`, `get_embeddings_by_ids`, `update_metadata` (v2.4+) |
+| **Executor Injection**    | ✅     | Share thread pool across async instances for ONNX safety (v2.4+) |
+| **Collection Management** | ✅     | `delete_collection()`, paginated `get_documents(limit=, offset=)` (v2.5+) |
+| **Cross-Process Safety**  | ✅     | Advisory file locking on usearch index files (v2.5+) |
+| **FLOAT16 Quantization**  | ✅     | Half-precision storage with 2x compression (v2.5+) |
+| **Embeddings Server**     | ✅     | CORS, graceful shutdown, input validation, model warm-up (v2.5+) |
 ## Performance Benchmarks
@@ -400,6 +439,11 @@ pip install torch --index-url https://download.pytorch.org/whl/cu118
 - [x] Hierarchical document relationships (parent/child)
 - [x] Cross-collection search
 - [x] Vector clustering and auto-tagging (v2.2)
+- [x] Public catalog API for document management (v2.4)
+- [x] Async executor injection for thread-safe sharing (v2.4)
+- [x] Collection management: `delete_collection()`, pagination (v2.5)
+- [x] Cross-process file locking and connection health checks (v2.5)
+- [x] Embeddings server hardening: CORS, graceful shutdown, input validation (v2.5)
 - [ ] Incremental clustering (online learning)
 - [ ] Cluster visualization exports

{simplevecdb-2.3.0 → simplevecdb-2.5.0/docs}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,41 @@ All notable changes to SimpleVecDB will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [2.4.0] - 2026-03-22
+### Added
+- **Public catalog API on VectorCollection + AsyncVectorCollection:**
+  - `get_documents(filter_dict=)` — replaces private `_catalog` access
+  - `get_embeddings_by_ids(ids)` — fetch stored embeddings
+  - `update_metadata(updates)` — batch metadata merge
+  - `count()`, `save()`, `dim` property — async wrappers
+  - `add_texts(parent_ids=, threads=)` — full param support on async
+  - `rebuild_index`, `get_children/parent/descendants/ancestors`, `set_parent` — async hierarchy API
+- **Executor injection on AsyncVectorDB** — accept optional `executor` keyword argument so consumers can share a single-threaded executor for ONNX/usearch thread safety; `close()` only shuts down executor when `_owns_executor` is True
+- **Safety constants** in `constants.py`: `SEARCH_COLLECTION_TIMEOUT`, `EXECUTOR_SHUTDOWN_TIMEOUT`, `MAX_HIERARCHY_DEPTH`
+### Fixed
+- **VectorDB.close()** now calls `conn.close()` — was leaking file descriptors when `save()` succeeded but connection was never closed
+- **VectorDB.close()** wraps `save()` in `try/finally` so `conn.close()` always runs even if index serialization fails
+- **add_documents ID recovery** uses `last_insert_rowid()` arithmetic instead of `ORDER BY id DESC LIMIT N`, which raced under concurrent inserts
+- **String metadata filter** uses exact equality (`=`) instead of `LIKE` substring match — `{"type": "doc"}` no longer matches `"markdown_doc"`
+- **update_metadata_batch** wrapped in single transaction (`with self.conn`) to prevent partial commits on crash
+- **rebuild_index** uses `if x is not None` instead of `x or default` so passing `connectivity=0` no longer silently uses the default
+- **search_collections** parallel futures now have a 30s timeout — one hung collection can no longer block the entire cross-collection search
+- **AsyncVectorDB.close()** uses `shutdown(wait=False, cancel_futures=True)` instead of blocking `shutdown(wait=True)` which could hang forever on stuck tasks
+- **Recursive CTE safety cap** — `get_descendants`/`get_ancestors` apply `MAX_HIERARCHY_DEPTH=100` when `max_depth=None` to prevent infinite recursion from parent_id cycles
+- **RateLimiter cleanup** capped to 500 evictions per call to bound lock hold time under high bucket counts
+- **HuggingFace download** now uses `etag_timeout=30` with local-cache fallback on network failure
+- **embed_texts** rejects batches over 10,000 texts to prevent unbounded CPU time
+- **retry_on_lock** adds `total_timeout=10s` budget — gives up early if cumulative sleep would exceed the budget
+### Changed
+- **`__version__`** now read from package metadata via `importlib.metadata` (single source of truth in `pyproject.toml`)
+- **Upsert in usearch_index** separates conflict detection from removal for clearer flow
 ## [2.3.0] - 2026-03-08
 ### Breaking Changes
@@ -485,6 +520,7 @@ Benchmarks on i9-13900K & RTX 4090 with 10k vectors (384-dim):
 - **Documentation**: https://coderdayton.github.io/simplevecdb/
 - **License**: MIT
+[2.4.0]: https://github.com/coderdayton/simplevecdb/releases/tag/v2.4.0
 [2.3.0]: https://github.com/coderdayton/simplevecdb/releases/tag/v2.3.0
 [2.2.1]: https://github.com/coderdayton/simplevecdb/releases/tag/v2.2.1
 [2.2.0]: https://github.com/coderdayton/simplevecdb/releases/tag/v2.2.0

{simplevecdb-2.3.0 → simplevecdb-2.5.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "simplevecdb"
-version = "2.3.0"
+version = "2.5.0"
 description = "Dead-simple local vector database powered by usearch HNSW."
 authors = [{ name = "Dayton Dunbar", email = "coderdayton14@gmail.com" }]
 license = { text = "MIT" }

{simplevecdb-2.3.0 → simplevecdb-2.5.0}/src/simplevecdb/__init__.py RENAMED Viewed

@@ -16,10 +16,18 @@ try:
 except ImportError:
     pass
 from .logging import get_logger, configure_logging, log_operation
-from .utils import DatabaseLockedError, retry_on_lock, validate_filter
+from .utils import (
+    DatabaseLockedError,
+    async_retry_on_lock,
+    file_lock,
+    retry_on_lock,
+    validate_filter,
+)
 from .encryption import EncryptionError, EncryptionUnavailableError
-__version__ = "2.3.0"
+from importlib.metadata import version as _pkg_version
+__version__ = _pkg_version("simplevecdb")
 __all__ = [
     # Core classes
     "VectorDB",
@@ -47,6 +55,8 @@ __all__ = [
     "MigrationRequiredError",
     "EncryptionError",
     "EncryptionUnavailableError",
+    "async_retry_on_lock",
+    "file_lock",
     "retry_on_lock",
     "validate_filter",
 ]

simplevecdb 2.3.0__tar.gz → 2.5.0__tar.gz

simplevecdb 2.3.0tar.gz → 2.5.0tar.gz