knowledge-rag 3.7.0__tar.gz → 3.8.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: knowledge-rag
3
- Version: 3.7.0
3
+ Version: 3.8.0
4
4
  Summary: Local RAG System for Claude Code — Hybrid search + Cross-encoder Reranking + 12 MCP Tools + 20 Format Parsers. Zero external servers.
5
5
  Project-URL: Homepage, https://github.com/lyonzin/knowledge-rag
6
6
  Project-URL: Repository, https://github.com/lyonzin/knowledge-rag
@@ -42,7 +42,7 @@ Description-Content-Type: text/markdown
42
42
 
43
43
  [![PyPI](https://img.shields.io/pypi/v/knowledge-rag)](https://pypi.org/project/knowledge-rag/)
44
44
  [![NPM](https://img.shields.io/npm/v/knowledge-rag)](https://www.npmjs.com/package/knowledge-rag)
45
- [![Downloads](https://static.pepy.tech/badge/knowledge-rag/month)](https://pepy.tech/project/knowledge-rag)
45
+ [![PyPI Downloads](https://static.pepy.tech/personalized-badge/knowledge-rag?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/knowledge-rag)
46
46
  ![Python](https://img.shields.io/badge/python-3.11%2B-green.svg)
47
47
  ![License](https://img.shields.io/badge/license-MIT-yellow.svg)
48
48
  ![Platform](https://img.shields.io/badge/platform-Windows%20%7C%20Linux%20%7C%20macOS-lightgrey.svg)
@@ -71,11 +71,21 @@ pip install knowledge-rag → restart Claude Code → search_knowledge("your que
71
71
 
72
72
  ---
73
73
 
74
- ## What's New in v3.6.0
74
+ ## What's New in v3.8.0
75
75
 
76
- ### Multi-Language Code Parsing
76
+ ### Lazy-Loaded Embeddings — Cheaper Idle Processes
77
77
 
78
- Language-aware extraction for **C**, **C++**, **JavaScript**, **TypeScript**, and **XML**functions, classes, structs, interfaces, imports, and namespaces are captured as searchable metadata. Total supported formats: **20**.
78
+ The FastEmbed ONNX model (~200MB resident) now loads on the **first query**, not at startup. Idle `knowledge-rag` processes are now genuinely cheap. Why this matters: MCP stdio is one-process-per-client by protocol multiple Claude Code windows, Claude Desktop + IDE simultaneously, or review/approval flows that open extra connections all spawn their own processes. Before v3.8.0, every one of them paid the full embedding-model cost up front. Now only processes that actually serve queries load the model. Public API is unchanged.
79
+
80
+ ### Opt-In Single-Instance Guard
81
+
82
+ For users who measured their setup and want a hard cap of one server per `data_dir`:
83
+
84
+ ```bash
85
+ export KNOWLEDGE_RAG_SINGLE_INSTANCE=1
86
+ ```
87
+
88
+ A second instance exits immediately with code 75. **OFF by default** so multi-client MCP usage continues to work unchanged. Stale-PID recovery + SIGINT/SIGTERM cleanup wired correctly. Full guide in [docs/single-instance.md](docs/single-instance.md). Sample MCP config in [examples/mcp-config-single-instance.json](examples/mcp-config-single-instance.json).
79
89
 
80
90
  ### 5 Ways to Install
81
91
 
@@ -91,6 +101,7 @@ All methods produce the same MCP server. See [Installation](#installation) for f
91
101
 
92
102
  ### Recent Highlights
93
103
 
104
+ - **v3.8.0** — Lazy-load embeddings, opt-in single-instance guard, version sync across PyPI/NPM/Docker
94
105
  - **v3.6.0** — Multi-language code parsing (C/C++/JS/TS/XML), NPM wrapper, Docker image, automated release pipeline
95
106
  - **v3.5.2** — CUDA DLL auto-discovery from pip packages, graceful GPU→CPU fallback, explicit CPU provider (no CUDA noise when `gpu: false`), BASE_DIR resolution fix for editable installs
96
107
  - **v3.5.1** — Remove Python `<3.13` upper bound — 3.13 and 3.14 now supported
@@ -1088,12 +1099,33 @@ The cross-encoder reranker model is lazy-loaded on the first query. This adds a
1088
1099
 
1089
1100
  ### Memory usage
1090
1101
 
1091
- With ~200 documents, expect ~300-500MB RAM. The embedding model (~50MB) and reranker (~25MB) are loaded into memory. For very large knowledge bases (1000+ documents), consider enabling GPU acceleration and using exclude patterns to limit index scope.
1102
+ With ~200 documents, expect ~300-500MB RAM. The embedding model (~200MB ONNX runtime resident, lazy-loaded on first query since v3.8.0) and reranker (~25MB, lazy-loaded) are loaded into memory only when actually used. For very large knowledge bases (1000+ documents), consider enabling GPU acceleration and using exclude patterns to limit index scope.
1103
+
1104
+ ### Multiple MCP clients spawn duplicate servers
1105
+
1106
+ MCP stdio is one process per client by protocol — multiple Claude Code windows, Claude Desktop + IDE, etc. each spawn their own `knowledge-rag` process. Since v3.8.0 idle processes are cheap (no embedding model loaded until first query). If you've measured and want a hard cap of one server per data directory, opt in:
1107
+
1108
+ ```bash
1109
+ export KNOWLEDGE_RAG_SINGLE_INSTANCE=1
1110
+ ```
1111
+
1112
+ A second instance exits immediately with code 75. Default is OFF (multi-client friendly). Full guide: [docs/single-instance.md](docs/single-instance.md). Sample MCP config: [examples/mcp-config-single-instance.json](examples/mcp-config-single-instance.json).
1092
1113
 
1093
1114
  ---
1094
1115
 
1095
1116
  ## Changelog
1096
1117
 
1118
+ ### v3.8.0 (2026-05-10)
1119
+
1120
+ - **NEW**: Lazy-load FastEmbed embedding model (~200MB ONNX runtime). Loads on first query instead of startup — idle `knowledge-rag` processes are now cheap, which matters when MCP stdio clients spawn parallel server processes (multiple Claude Code windows, Claude Desktop + IDE, etc.). Public API unchanged. (#32)
1121
+ - **NEW**: Opt-in single-instance guard via `KNOWLEDGE_RAG_SINGLE_INSTANCE=1` env var. **OFF by default** — multi-client MCP usage continues to work unchanged. When enabled, a second server process for the same `data_dir` exits with code 75 (`EX_TEMPFAIL`). Includes stale-PID recovery and SIGINT/SIGTERM handlers. See [docs/single-instance.md](docs/single-instance.md). (#33, original concept by @Hohlas in #31)
1122
+ - **NEW**: `examples/mcp-config-single-instance.json` — sample MCP client config for the opt-in guard.
1123
+ - **DOCS**: New `docs/single-instance.md` — when to use, when NOT to use, troubleshooting, full activation reference.
1124
+ - **DOCS**: README troubleshooting section for "Multiple MCP clients spawn duplicate servers" + memory-usage note for lazy embeddings.
1125
+ - **CHORE**: Sync version across `pyproject.toml`, `mcp_server/__init__.py`, and `npm/package.json` (was drifting since v3.5.x).
1126
+ - **CHORE**: pytest `tmp_path_retention_count=1` to avoid Windows atexit cleanup race in CI.
1127
+ - **ROADMAP**: Tracked v4.0 shared-service architecture (one daemon, many thin MCP clients) as the long-term fix for multi-process resource duplication. (#34)
1128
+
1097
1129
  ### Unreleased
1098
1130
 
1099
1131
  - **FIX**: Startup preflight probes ChromaDB in a child process and moves crashing persistent indexes to `data/backups/auto-repair-*` before MCP initialization.
@@ -4,7 +4,7 @@
4
4
 
5
5
  [![PyPI](https://img.shields.io/pypi/v/knowledge-rag)](https://pypi.org/project/knowledge-rag/)
6
6
  [![NPM](https://img.shields.io/npm/v/knowledge-rag)](https://www.npmjs.com/package/knowledge-rag)
7
- [![Downloads](https://static.pepy.tech/badge/knowledge-rag/month)](https://pepy.tech/project/knowledge-rag)
7
+ [![PyPI Downloads](https://static.pepy.tech/personalized-badge/knowledge-rag?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/knowledge-rag)
8
8
  ![Python](https://img.shields.io/badge/python-3.11%2B-green.svg)
9
9
  ![License](https://img.shields.io/badge/license-MIT-yellow.svg)
10
10
  ![Platform](https://img.shields.io/badge/platform-Windows%20%7C%20Linux%20%7C%20macOS-lightgrey.svg)
@@ -33,11 +33,21 @@ pip install knowledge-rag → restart Claude Code → search_knowledge("your que
33
33
 
34
34
  ---
35
35
 
36
- ## What's New in v3.6.0
36
+ ## What's New in v3.8.0
37
37
 
38
- ### Multi-Language Code Parsing
38
+ ### Lazy-Loaded Embeddings — Cheaper Idle Processes
39
39
 
40
- Language-aware extraction for **C**, **C++**, **JavaScript**, **TypeScript**, and **XML**functions, classes, structs, interfaces, imports, and namespaces are captured as searchable metadata. Total supported formats: **20**.
40
+ The FastEmbed ONNX model (~200MB resident) now loads on the **first query**, not at startup. Idle `knowledge-rag` processes are now genuinely cheap. Why this matters: MCP stdio is one-process-per-client by protocol multiple Claude Code windows, Claude Desktop + IDE simultaneously, or review/approval flows that open extra connections all spawn their own processes. Before v3.8.0, every one of them paid the full embedding-model cost up front. Now only processes that actually serve queries load the model. Public API is unchanged.
41
+
42
+ ### Opt-In Single-Instance Guard
43
+
44
+ For users who measured their setup and want a hard cap of one server per `data_dir`:
45
+
46
+ ```bash
47
+ export KNOWLEDGE_RAG_SINGLE_INSTANCE=1
48
+ ```
49
+
50
+ A second instance exits immediately with code 75. **OFF by default** so multi-client MCP usage continues to work unchanged. Stale-PID recovery + SIGINT/SIGTERM cleanup wired correctly. Full guide in [docs/single-instance.md](docs/single-instance.md). Sample MCP config in [examples/mcp-config-single-instance.json](examples/mcp-config-single-instance.json).
41
51
 
42
52
  ### 5 Ways to Install
43
53
 
@@ -53,6 +63,7 @@ All methods produce the same MCP server. See [Installation](#installation) for f
53
63
 
54
64
  ### Recent Highlights
55
65
 
66
+ - **v3.8.0** — Lazy-load embeddings, opt-in single-instance guard, version sync across PyPI/NPM/Docker
56
67
  - **v3.6.0** — Multi-language code parsing (C/C++/JS/TS/XML), NPM wrapper, Docker image, automated release pipeline
57
68
  - **v3.5.2** — CUDA DLL auto-discovery from pip packages, graceful GPU→CPU fallback, explicit CPU provider (no CUDA noise when `gpu: false`), BASE_DIR resolution fix for editable installs
58
69
  - **v3.5.1** — Remove Python `<3.13` upper bound — 3.13 and 3.14 now supported
@@ -1050,12 +1061,33 @@ The cross-encoder reranker model is lazy-loaded on the first query. This adds a
1050
1061
 
1051
1062
  ### Memory usage
1052
1063
 
1053
- With ~200 documents, expect ~300-500MB RAM. The embedding model (~50MB) and reranker (~25MB) are loaded into memory. For very large knowledge bases (1000+ documents), consider enabling GPU acceleration and using exclude patterns to limit index scope.
1064
+ With ~200 documents, expect ~300-500MB RAM. The embedding model (~200MB ONNX runtime resident, lazy-loaded on first query since v3.8.0) and reranker (~25MB, lazy-loaded) are loaded into memory only when actually used. For very large knowledge bases (1000+ documents), consider enabling GPU acceleration and using exclude patterns to limit index scope.
1065
+
1066
+ ### Multiple MCP clients spawn duplicate servers
1067
+
1068
+ MCP stdio is one process per client by protocol — multiple Claude Code windows, Claude Desktop + IDE, etc. each spawn their own `knowledge-rag` process. Since v3.8.0 idle processes are cheap (no embedding model loaded until first query). If you've measured and want a hard cap of one server per data directory, opt in:
1069
+
1070
+ ```bash
1071
+ export KNOWLEDGE_RAG_SINGLE_INSTANCE=1
1072
+ ```
1073
+
1074
+ A second instance exits immediately with code 75. Default is OFF (multi-client friendly). Full guide: [docs/single-instance.md](docs/single-instance.md). Sample MCP config: [examples/mcp-config-single-instance.json](examples/mcp-config-single-instance.json).
1054
1075
 
1055
1076
  ---
1056
1077
 
1057
1078
  ## Changelog
1058
1079
 
1080
+ ### v3.8.0 (2026-05-10)
1081
+
1082
+ - **NEW**: Lazy-load FastEmbed embedding model (~200MB ONNX runtime). Loads on first query instead of startup — idle `knowledge-rag` processes are now cheap, which matters when MCP stdio clients spawn parallel server processes (multiple Claude Code windows, Claude Desktop + IDE, etc.). Public API unchanged. (#32)
1083
+ - **NEW**: Opt-in single-instance guard via `KNOWLEDGE_RAG_SINGLE_INSTANCE=1` env var. **OFF by default** — multi-client MCP usage continues to work unchanged. When enabled, a second server process for the same `data_dir` exits with code 75 (`EX_TEMPFAIL`). Includes stale-PID recovery and SIGINT/SIGTERM handlers. See [docs/single-instance.md](docs/single-instance.md). (#33, original concept by @Hohlas in #31)
1084
+ - **NEW**: `examples/mcp-config-single-instance.json` — sample MCP client config for the opt-in guard.
1085
+ - **DOCS**: New `docs/single-instance.md` — when to use, when NOT to use, troubleshooting, full activation reference.
1086
+ - **DOCS**: README troubleshooting section for "Multiple MCP clients spawn duplicate servers" + memory-usage note for lazy embeddings.
1087
+ - **CHORE**: Sync version across `pyproject.toml`, `mcp_server/__init__.py`, and `npm/package.json` (was drifting since v3.5.x).
1088
+ - **CHORE**: pytest `tmp_path_retention_count=1` to avoid Windows atexit cleanup race in CI.
1089
+ - **ROADMAP**: Tracked v4.0 shared-service architecture (one daemon, many thin MCP clients) as the long-term fix for multi-process resource duplication. (#34)
1090
+
1059
1091
  ### Unreleased
1060
1092
 
1061
1093
  - **FIX**: Startup preflight probes ChromaDB in a child process and moves crashing persistent indexes to `data/backups/auto-repair-*` before MCP initialization.
@@ -8,7 +8,7 @@ import sys # noqa: I001
8
8
  _original_stdout = sys.stdout
9
9
  sys.stdout = sys.stderr
10
10
 
11
- __version__ = "3.5.2"
11
+ __version__ = "3.8.0"
12
12
  __author__ = "Ailton Rocha (Lyon.)"
13
13
 
14
14
  from .config import Config # noqa: E402
@@ -0,0 +1,188 @@
1
+ """Optional single-instance guard for the MCP server process.
2
+
3
+ Background
4
+ ----------
5
+ MCP stdio servers are 1-process-per-client by protocol design. Multiple
6
+ Claude Code windows, Claude Desktop + IDE running simultaneously, or clients
7
+ that open extra internal connections during approval/review flows will all
8
+ spawn additional `knowledge-rag` processes. Each process holds its own
9
+ embedding model, ChromaDB client, BM25 state, and file watcher.
10
+
11
+ Lazy-loading the embedding model (v3.8.0) reduces idle cost dramatically,
12
+ but some users still want a hard cap of one process per data directory.
13
+ This module provides that cap as an OPT-IN, never as a default.
14
+
15
+ Activation
16
+ ----------
17
+ Set the environment variable in your MCP client config:
18
+
19
+ KNOWLEDGE_RAG_SINGLE_INSTANCE=1 # also accepts: true, yes, on (case-insensitive)
20
+
21
+ When unset (default), `single_instance_lock()` is a no-op and the server
22
+ behaves exactly as it did before this module existed.
23
+
24
+ When enabled, the server creates `<data_dir>/knowledge-rag.lock` containing
25
+ its PID. A second process starting against the same `data_dir` will detect
26
+ the live PID and exit with code 75 (EX_TEMPFAIL). Stale locks (PID gone)
27
+ are cleaned up automatically.
28
+
29
+ Cleanup is wired in three places so the lock does not outlive the process:
30
+ 1. Normal exit: contextmanager `finally` block removes the lock.
31
+ 2. SIGINT / SIGTERM: handlers remove the lock and re-raise the default action.
32
+ 3. Crash / SIGKILL: stale-PID detection on the next startup removes it.
33
+
34
+ Authors
35
+ -------
36
+ - Concept and original guard: Sergey Khokhlov (@Hohlas) in PR #31
37
+ - Reworked as opt-in + signal handlers + tests: Lyon. (knowledge-rag maintainer)
38
+ """
39
+
40
+ from __future__ import annotations
41
+
42
+ import os
43
+ import signal
44
+ from contextlib import contextmanager
45
+ from pathlib import Path
46
+ from typing import Iterator, Optional
47
+
48
+ from .config import config
49
+
50
+ LOCK_FILENAME = "knowledge-rag.lock"
51
+ ALREADY_RUNNING_EXIT_CODE = 75 # EX_TEMPFAIL from sysexits.h
52
+ ENV_VAR = "KNOWLEDGE_RAG_SINGLE_INSTANCE"
53
+ _TRUTHY = {"1", "true", "yes", "on"}
54
+
55
+
56
+ class AlreadyRunningError(RuntimeError):
57
+ """Raised when another knowledge-rag server instance already holds the lock."""
58
+
59
+
60
+ def single_instance_enabled() -> bool:
61
+ """Return True if the user opted into the single-instance guard.
62
+
63
+ Reads `KNOWLEDGE_RAG_SINGLE_INSTANCE`. Accepts ``1``, ``true``, ``yes``, ``on``
64
+ (case-insensitive, surrounding whitespace ignored). Anything else — including
65
+ unset, empty, ``0``, ``false`` — leaves the guard disabled.
66
+ """
67
+ raw = os.environ.get(ENV_VAR, "").strip().lower()
68
+ return raw in _TRUTHY
69
+
70
+
71
+ def _pid_is_running(pid: int) -> bool:
72
+ """Return True if a process with PID appears to be alive."""
73
+ if pid <= 0:
74
+ return False
75
+ try:
76
+ os.kill(pid, 0)
77
+ except ProcessLookupError:
78
+ return False
79
+ except PermissionError:
80
+ # Process exists but is owned by another user / has tighter ACLs
81
+ return True
82
+ except OSError:
83
+ return False
84
+ return True
85
+
86
+
87
+ def _read_lock_pid(lock_path: Path) -> Optional[int]:
88
+ try:
89
+ raw = lock_path.read_text(encoding="utf-8").strip().splitlines()[0]
90
+ return int(raw)
91
+ except (IndexError, OSError, ValueError):
92
+ return None
93
+
94
+
95
+ def _lock_path() -> Path:
96
+ return config.data_dir / LOCK_FILENAME
97
+
98
+
99
+ def _remove_if_ours(lock_path: Path) -> None:
100
+ """Remove the lock file ONLY if it still references our PID."""
101
+ if _read_lock_pid(lock_path) == os.getpid():
102
+ try:
103
+ lock_path.unlink()
104
+ except FileNotFoundError:
105
+ pass
106
+ except OSError:
107
+ # Best-effort; stale-PID check on next startup will recover
108
+ pass
109
+
110
+
111
+ @contextmanager
112
+ def single_instance_lock() -> Iterator[Optional[Path]]:
113
+ """Acquire the single-instance lock if opt-in flag is set.
114
+
115
+ No-op when ``KNOWLEDGE_RAG_SINGLE_INSTANCE`` is unset / falsy — yields ``None``
116
+ and the caller proceeds normally with no side effects on disk.
117
+
118
+ When enabled:
119
+ - Creates ``<data_dir>/knowledge-rag.lock`` containing this process's PID.
120
+ - Raises :class:`AlreadyRunningError` if another live PID already holds it.
121
+ - Recovers stale locks (PID no longer running).
122
+ - Registers SIGINT/SIGTERM handlers that remove the lock and re-raise.
123
+ - Removes the lock on normal exit via ``finally``.
124
+ """
125
+ if not single_instance_enabled():
126
+ yield None
127
+ return
128
+
129
+ config.data_dir.mkdir(parents=True, exist_ok=True)
130
+ lock_path = _lock_path()
131
+
132
+ while True:
133
+ try:
134
+ fd = os.open(lock_path, os.O_CREAT | os.O_EXCL | os.O_WRONLY, 0o644)
135
+ except FileExistsError:
136
+ pid = _read_lock_pid(lock_path)
137
+ if pid is not None and _pid_is_running(pid):
138
+ raise AlreadyRunningError(
139
+ f"knowledge-rag MCP server is already running (pid {pid}). "
140
+ f"Refusing to start a second instance because "
141
+ f"{ENV_VAR} is enabled."
142
+ )
143
+ try:
144
+ lock_path.unlink()
145
+ except FileNotFoundError:
146
+ pass
147
+ except OSError as exc:
148
+ raise AlreadyRunningError(f"Failed to clear stale lock {lock_path}: {exc}") from exc
149
+ continue
150
+
151
+ with os.fdopen(fd, "w", encoding="utf-8") as f:
152
+ f.write(f"{os.getpid()}\n")
153
+ break
154
+
155
+ # Wire signal handlers so SIGINT/SIGTERM cleanup the lock before exit
156
+ previous_handlers: dict[int, object] = {}
157
+
158
+ def _signal_cleanup(signum: int, frame) -> None:
159
+ _remove_if_ours(lock_path)
160
+ # Restore original handler and re-raise so default action runs
161
+ prev = previous_handlers.get(signum, signal.SIG_DFL)
162
+ try:
163
+ signal.signal(signum, prev) # type: ignore[arg-type]
164
+ except (ValueError, OSError):
165
+ pass
166
+ # Re-send the signal to ourselves so the original disposition fires
167
+ os.kill(os.getpid(), signum)
168
+
169
+ for sig_name in ("SIGINT", "SIGTERM"):
170
+ sig = getattr(signal, sig_name, None)
171
+ if sig is None:
172
+ continue
173
+ try:
174
+ previous_handlers[sig] = signal.getsignal(sig)
175
+ signal.signal(sig, _signal_cleanup)
176
+ except (ValueError, OSError):
177
+ # signal.signal raises if not on the main thread; tests may hit this
178
+ pass
179
+
180
+ try:
181
+ yield lock_path
182
+ finally:
183
+ _remove_if_ours(lock_path)
184
+ for sig, prev in previous_handlers.items():
185
+ try:
186
+ signal.signal(sig, prev) # type: ignore[arg-type]
187
+ except (ValueError, OSError):
188
+ pass
@@ -136,6 +136,17 @@ class FastEmbedEmbeddings:
136
136
  Uses ONNX Runtime in-process for embedding generation.
137
137
  No external server required (replaces Ollama).
138
138
  Model: BAAI/bge-small-en-v1.5 (384-dim, MTEB score 62.x)
139
+
140
+ Lazy-loading (since v3.8.0):
141
+ The ONNX model (~200MB resident) is NOT loaded in __init__.
142
+ It loads on the first call to __call__/embed_query/embed_documents.
143
+ This makes idle MCP server processes cheap, which matters when
144
+ multiple stdio clients spawn parallel knowledge-rag processes
145
+ (e.g. multiple Claude Code windows). The CrossEncoderReranker
146
+ already follows this same pattern.
147
+
148
+ Thread-safe: load is guarded by a lock so concurrent first-callers
149
+ don't double-initialize the model.
139
150
  """
140
151
 
141
152
  @staticmethod
@@ -174,20 +185,34 @@ class FastEmbedEmbeddings:
174
185
  def __init__(self, model: str = None):
175
186
  self.model_name = model or config.embedding_model
176
187
  self._dim = config.embedding_dim
177
- kwargs = {"model_name": self.model_name, "cache_dir": str(config.models_cache_dir)}
178
- if config.gpu_acceleration:
179
- self._setup_cuda_dll_paths()
180
- kwargs["providers"] = ["CUDAExecutionProvider", "CPUExecutionProvider"]
181
- print(f"[INFO] Loading embedding model: {self.model_name} ({self._dim}D) [GPU accelerated]...")
182
- try:
183
- self._model = TextEmbedding(**kwargs)
184
- print("[INFO] Embedding model loaded successfully [GPU]")
185
- except (ValueError, RuntimeError) as e:
186
- print(f"[WARN] GPU init failed ({e}), falling back to CPU...")
187
- kwargs["providers"] = ["CPUExecutionProvider"]
188
- self._model = TextEmbedding(**kwargs)
189
- print("[INFO] Embedding model loaded successfully [CPU fallback]")
190
- else:
188
+ # Build kwargs once; defer the heavy TextEmbedding(**kwargs) call to first use.
189
+ self._init_kwargs = {"model_name": self.model_name, "cache_dir": str(config.models_cache_dir)}
190
+ self._gpu = bool(config.gpu_acceleration)
191
+ self._model: Optional[TextEmbedding] = None
192
+ self._load_lock = threading.Lock()
193
+
194
+ def _load_model(self) -> None:
195
+ """Load the ONNX model on demand. Idempotent and thread-safe."""
196
+ if self._model is not None:
197
+ return
198
+ with self._load_lock:
199
+ if self._model is not None: # double-checked under the lock
200
+ return
201
+ kwargs = dict(self._init_kwargs)
202
+ if self._gpu:
203
+ self._setup_cuda_dll_paths()
204
+ kwargs["providers"] = ["CUDAExecutionProvider", "CPUExecutionProvider"]
205
+ print(f"[INFO] Loading embedding model: {self.model_name} ({self._dim}D) [GPU accelerated]...")
206
+ try:
207
+ self._model = TextEmbedding(**kwargs)
208
+ print("[INFO] Embedding model loaded successfully [GPU]")
209
+ return
210
+ except (ValueError, RuntimeError) as e:
211
+ print(f"[WARN] GPU init failed ({e}), falling back to CPU...")
212
+ kwargs["providers"] = ["CPUExecutionProvider"]
213
+ self._model = TextEmbedding(**kwargs)
214
+ print("[INFO] Embedding model loaded successfully [CPU fallback]")
215
+ return
191
216
  kwargs["providers"] = ["CPUExecutionProvider"]
192
217
  print(f"[INFO] Loading embedding model: {self.model_name} ({self._dim}D)...")
193
218
  self._model = TextEmbedding(**kwargs)
@@ -203,6 +228,7 @@ class FastEmbedEmbeddings:
203
228
  if not input:
204
229
  return []
205
230
 
231
+ self._load_model()
206
232
  try:
207
233
  embeddings = list(self._model.embed(input))
208
234
  return [emb.tolist() for emb in embeddings]
@@ -1934,48 +1960,58 @@ def main():
1934
1960
  _handle_init()
1935
1961
  return
1936
1962
 
1963
+ from .instance_lock import (
1964
+ ALREADY_RUNNING_EXIT_CODE,
1965
+ AlreadyRunningError,
1966
+ single_instance_lock,
1967
+ )
1937
1968
  from .preflight import run_preflight
1938
1969
 
1939
- run_preflight()
1970
+ try:
1971
+ with single_instance_lock():
1972
+ run_preflight()
1940
1973
 
1941
- orchestrator = get_orchestrator()
1974
+ orchestrator = get_orchestrator()
1942
1975
 
1943
- # Migration: check dimension mismatch AFTER full init (avoids segfault during __init__)
1944
- orchestrator._needs_rebuild = orchestrator._check_dimension_mismatch()
1945
- if orchestrator._needs_rebuild:
1946
- print("[MIGRATION] Running nuclear rebuild for embedding model change...")
1947
- try:
1948
- stats = orchestrator.nuclear_rebuild()
1949
- print(
1950
- f"[MIGRATION] Rebuild complete: {stats['indexed']} docs, "
1951
- f"{stats['chunks_added']} chunks in {stats.get('elapsed_seconds', '?')}s"
1952
- )
1953
- except Exception as e:
1954
- print(f"[ERROR] Migration failed: {e}")
1955
- print("[FALLBACK] Attempting regular index instead...")
1956
- stats = orchestrator.index_all(force=True)
1957
- elif orchestrator.collection.count() == 0:
1958
- print("[INFO] No documents indexed. Running initial indexing...")
1959
- stats = orchestrator.index_all()
1960
- print(f"[INFO] Indexed {stats['indexed']} documents with {stats['chunks_added']} chunks")
1976
+ # Migration: check dimension mismatch AFTER full init (avoids segfault during __init__)
1977
+ orchestrator._needs_rebuild = orchestrator._check_dimension_mismatch()
1978
+ if orchestrator._needs_rebuild:
1979
+ print("[MIGRATION] Running nuclear rebuild for embedding model change...")
1980
+ try:
1981
+ stats = orchestrator.nuclear_rebuild()
1982
+ print(
1983
+ f"[MIGRATION] Rebuild complete: {stats['indexed']} docs, "
1984
+ f"{stats['chunks_added']} chunks in {stats.get('elapsed_seconds', '?')}s"
1985
+ )
1986
+ except Exception as e:
1987
+ print(f"[ERROR] Migration failed: {e}")
1988
+ print("[FALLBACK] Attempting regular index instead...")
1989
+ stats = orchestrator.index_all(force=True)
1990
+ elif orchestrator.collection.count() == 0:
1991
+ print("[INFO] No documents indexed. Running initial indexing...")
1992
+ stats = orchestrator.index_all()
1993
+ print(f"[INFO] Indexed {stats['indexed']} documents with {stats['chunks_added']} chunks")
1994
+
1995
+ # Start file watcher for auto-reindex on document changes
1996
+ try:
1997
+ watcher = DocumentWatcher(get_orchestrator, debounce_seconds=5.0)
1998
+ observer = Observer()
1999
+ observer.schedule(watcher, str(config.documents_dir), recursive=True)
2000
+ observer.daemon = True
2001
+ observer.start()
2002
+ print(f"[WATCHER] Monitoring {config.documents_dir} for changes")
2003
+ except Exception as e:
2004
+ print(f"[WARN] Failed to start file watcher: {e}")
2005
+ print("[WARN] Auto-reindexing disabled. Use reindex_documents tool manually.")
1961
2006
 
1962
- # Start file watcher for auto-reindex on document changes
1963
- try:
1964
- watcher = DocumentWatcher(get_orchestrator, debounce_seconds=5.0)
1965
- observer = Observer()
1966
- observer.schedule(watcher, str(config.documents_dir), recursive=True)
1967
- observer.daemon = True
1968
- observer.start()
1969
- print(f"[WATCHER] Monitoring {config.documents_dir} for changes")
1970
- except Exception as e:
1971
- print(f"[WARN] Failed to start file watcher: {e}")
1972
- print("[WARN] Auto-reindexing disabled. Use reindex_documents tool manually.")
1973
-
1974
- # Restore real stdout for MCP JSON-RPC, keep print() going to stderr
1975
- from . import _original_stdout
1976
-
1977
- sys.stdout = _original_stdout
1978
- mcp.run()
2007
+ # Restore real stdout for MCP JSON-RPC, keep print() going to stderr
2008
+ from . import _original_stdout
2009
+
2010
+ sys.stdout = _original_stdout
2011
+ mcp.run()
2012
+ except AlreadyRunningError as e:
2013
+ print(f"[ERROR] {e}", file=sys.stderr)
2014
+ raise SystemExit(ALREADY_RUNNING_EXIT_CODE) from e
1979
2015
 
1980
2016
 
1981
2017
  if __name__ == "__main__":
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
4
4
 
5
5
  [project]
6
6
  name = "knowledge-rag"
7
- version = "3.7.0"
7
+ version = "3.8.0"
8
8
  description = "Local RAG System for Claude Code — Hybrid search + Cross-encoder Reranking + 12 MCP Tools + 20 Format Parsers. Zero external servers."
9
9
  readme = "README.md"
10
10
  license = {text = "MIT"}
@@ -95,6 +95,11 @@ exclude = [
95
95
  [tool.pytest.ini_options]
96
96
  testpaths = ["tests"]
97
97
  pythonpath = ["."]
98
+ # Limit retained tmp_path directories to avoid pytest's atexit cleanup race
99
+ # on Windows (cleanup_numbered_dir + pathlib.glob "garbage-*" can fail when
100
+ # many tmp dirs accumulate). Tests run isolated; we don't need history.
101
+ tmp_path_retention_count = 1
102
+ tmp_path_retention_policy = "failed"
98
103
 
99
104
  [tool.ruff]
100
105
  target-version = "py311"
File without changes
File without changes