remote-embedding 0.2.1__tar.gz → 0.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {remote_embedding-0.2.1/src/remote_embedding.egg-info → remote_embedding-0.3.0}/PKG-INFO +14 -2
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/README.md +13 -1
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/pyproject.toml +1 -1
- remote_embedding-0.3.0/src/remote_embedding/__init__.py +12 -0
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding/app.py +128 -18
- {remote_embedding-0.2.1 → remote_embedding-0.3.0/src/remote_embedding.egg-info}/PKG-INFO +14 -2
- remote_embedding-0.2.1/src/remote_embedding/__init__.py +0 -6
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/LICENSE +0 -0
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/setup.cfg +0 -0
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding/__main__.py +0 -0
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding/remote.py +0 -0
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding.egg-info/SOURCES.txt +0 -0
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding.egg-info/dependency_links.txt +0 -0
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding.egg-info/entry_points.txt +0 -0
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding.egg-info/requires.txt +0 -0
- {remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: remote-embedding
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.0
|
|
4
4
|
Summary: A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.
|
|
5
5
|
Author: Meshkat Shariat Bagheri
|
|
6
6
|
License-Expression: MIT
|
|
@@ -60,6 +60,9 @@ PowerShell:
|
|
|
60
60
|
$env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
|
|
61
61
|
$env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
|
|
62
62
|
$env:DEVICE="cpu"
|
|
63
|
+
$env:MAX_LOADED_MODELS="1"
|
|
64
|
+
$env:MAX_INPUTS_PER_REQUEST="128"
|
|
65
|
+
$env:EMBEDDING_BATCH_SIZE="32"
|
|
63
66
|
$env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
|
|
64
67
|
$env:ENCODE_KWARGS='{"normalize_embeddings": true}'
|
|
65
68
|
```
|
|
@@ -70,6 +73,9 @@ Bash:
|
|
|
70
73
|
export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
|
|
71
74
|
export EMBEDDING_DIR=/path/to/model-cache
|
|
72
75
|
export DEVICE=cpu
|
|
76
|
+
export MAX_LOADED_MODELS=1
|
|
77
|
+
export MAX_INPUTS_PER_REQUEST=128
|
|
78
|
+
export EMBEDDING_BATCH_SIZE=32
|
|
73
79
|
export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
|
|
74
80
|
export ENCODE_KWARGS='{"normalize_embeddings": true}'
|
|
75
81
|
```
|
|
@@ -83,6 +89,9 @@ remote-embedding-server \
|
|
|
83
89
|
--model-name BAAI/bge-base-en-v1.5 \
|
|
84
90
|
--embedding-dir /path/to/model-cache \
|
|
85
91
|
--device cuda \
|
|
92
|
+
--max-loaded-models 1 \
|
|
93
|
+
--max-inputs-per-request 128 \
|
|
94
|
+
--embedding-batch-size 32 \
|
|
86
95
|
--model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
|
|
87
96
|
--encode-kwargs '{"normalize_embeddings": true}'
|
|
88
97
|
```
|
|
@@ -115,6 +124,9 @@ Server configuration:
|
|
|
115
124
|
- `EMBEDDING_MODEL_NAME`: default model to preload and use when a request does not pass `model_name`
|
|
116
125
|
- `EMBEDDING_DIR`: optional local cache/model directory for Hugging Face downloads or local files
|
|
117
126
|
- `DEVICE`: device passed to `HuggingFaceEmbeddings`, such as `cpu` or `cuda`
|
|
127
|
+
- `MAX_LOADED_MODELS`: maximum number of embedding model instances kept in memory, default `1`
|
|
128
|
+
- `MAX_INPUTS_PER_REQUEST`: maximum number of strings accepted in one `/embed` request, default `128`
|
|
129
|
+
- `EMBEDDING_BATCH_SIZE`: default encoder `batch_size`, default `32`
|
|
118
130
|
- `MODEL_KWARGS`: JSON object merged into `HuggingFaceEmbeddings(..., model_kwargs=...)`
|
|
119
131
|
- `ENCODE_KWARGS`: JSON object passed to `HuggingFaceEmbeddings(..., encode_kwargs=...)`
|
|
120
132
|
|
|
@@ -130,7 +142,7 @@ Client configuration through `RemoteEmbeddings(...)`:
|
|
|
130
142
|
|
|
131
143
|
If `EMBEDDING_MODEL_NAME` is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.
|
|
132
144
|
|
|
133
|
-
`model_kwargs` and `encode_kwargs` become part of the server-side model cache key.
|
|
145
|
+
`model_kwargs` and `encode_kwargs` become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once `MAX_LOADED_MODELS` is exceeded, and defaults to keeping one model loaded to protect GPU memory.
|
|
134
146
|
|
|
135
147
|
## Use The Client
|
|
136
148
|
|
|
@@ -31,6 +31,9 @@ PowerShell:
|
|
|
31
31
|
$env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
|
|
32
32
|
$env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
|
|
33
33
|
$env:DEVICE="cpu"
|
|
34
|
+
$env:MAX_LOADED_MODELS="1"
|
|
35
|
+
$env:MAX_INPUTS_PER_REQUEST="128"
|
|
36
|
+
$env:EMBEDDING_BATCH_SIZE="32"
|
|
34
37
|
$env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
|
|
35
38
|
$env:ENCODE_KWARGS='{"normalize_embeddings": true}'
|
|
36
39
|
```
|
|
@@ -41,6 +44,9 @@ Bash:
|
|
|
41
44
|
export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
|
|
42
45
|
export EMBEDDING_DIR=/path/to/model-cache
|
|
43
46
|
export DEVICE=cpu
|
|
47
|
+
export MAX_LOADED_MODELS=1
|
|
48
|
+
export MAX_INPUTS_PER_REQUEST=128
|
|
49
|
+
export EMBEDDING_BATCH_SIZE=32
|
|
44
50
|
export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
|
|
45
51
|
export ENCODE_KWARGS='{"normalize_embeddings": true}'
|
|
46
52
|
```
|
|
@@ -54,6 +60,9 @@ remote-embedding-server \
|
|
|
54
60
|
--model-name BAAI/bge-base-en-v1.5 \
|
|
55
61
|
--embedding-dir /path/to/model-cache \
|
|
56
62
|
--device cuda \
|
|
63
|
+
--max-loaded-models 1 \
|
|
64
|
+
--max-inputs-per-request 128 \
|
|
65
|
+
--embedding-batch-size 32 \
|
|
57
66
|
--model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
|
|
58
67
|
--encode-kwargs '{"normalize_embeddings": true}'
|
|
59
68
|
```
|
|
@@ -86,6 +95,9 @@ Server configuration:
|
|
|
86
95
|
- `EMBEDDING_MODEL_NAME`: default model to preload and use when a request does not pass `model_name`
|
|
87
96
|
- `EMBEDDING_DIR`: optional local cache/model directory for Hugging Face downloads or local files
|
|
88
97
|
- `DEVICE`: device passed to `HuggingFaceEmbeddings`, such as `cpu` or `cuda`
|
|
98
|
+
- `MAX_LOADED_MODELS`: maximum number of embedding model instances kept in memory, default `1`
|
|
99
|
+
- `MAX_INPUTS_PER_REQUEST`: maximum number of strings accepted in one `/embed` request, default `128`
|
|
100
|
+
- `EMBEDDING_BATCH_SIZE`: default encoder `batch_size`, default `32`
|
|
89
101
|
- `MODEL_KWARGS`: JSON object merged into `HuggingFaceEmbeddings(..., model_kwargs=...)`
|
|
90
102
|
- `ENCODE_KWARGS`: JSON object passed to `HuggingFaceEmbeddings(..., encode_kwargs=...)`
|
|
91
103
|
|
|
@@ -101,7 +113,7 @@ Client configuration through `RemoteEmbeddings(...)`:
|
|
|
101
113
|
|
|
102
114
|
If `EMBEDDING_MODEL_NAME` is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.
|
|
103
115
|
|
|
104
|
-
`model_kwargs` and `encode_kwargs` become part of the server-side model cache key.
|
|
116
|
+
`model_kwargs` and `encode_kwargs` become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once `MAX_LOADED_MODELS` is exceeded, and defaults to keeping one model loaded to protect GPU memory.
|
|
105
117
|
|
|
106
118
|
## Use The Client
|
|
107
119
|
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "remote-embedding"
|
|
7
|
-
version = "0.
|
|
7
|
+
version = "0.3.0"
|
|
8
8
|
description = "A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs."
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
requires-python = ">=3.10"
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
"""Public package exports for remote-embedding."""
|
|
2
|
+
|
|
3
|
+
from importlib.metadata import PackageNotFoundError, version
|
|
4
|
+
|
|
5
|
+
from .remote import RemoteEmbeddings
|
|
6
|
+
|
|
7
|
+
__all__ = ["RemoteEmbeddings"]
|
|
8
|
+
|
|
9
|
+
try:
|
|
10
|
+
__version__ = version("remote-embedding")
|
|
11
|
+
except PackageNotFoundError:
|
|
12
|
+
__version__ = "0.0.0"
|
|
@@ -2,10 +2,13 @@
|
|
|
2
2
|
|
|
3
3
|
import asyncio
|
|
4
4
|
import argparse
|
|
5
|
+
import gc
|
|
5
6
|
import json
|
|
6
7
|
import logging
|
|
7
8
|
import os
|
|
9
|
+
from collections import OrderedDict
|
|
8
10
|
from contextlib import asynccontextmanager
|
|
11
|
+
from importlib.metadata import PackageNotFoundError, version
|
|
9
12
|
from typing import Any, Literal, Optional, Union
|
|
10
13
|
|
|
11
14
|
import uvicorn
|
|
@@ -17,12 +20,23 @@ from pydantic import BaseModel, Field
|
|
|
17
20
|
load_dotenv()
|
|
18
21
|
logger = logging.getLogger("remote_embedding.server")
|
|
19
22
|
|
|
23
|
+
try:
|
|
24
|
+
PACKAGE_VERSION = version("remote-embedding")
|
|
25
|
+
except PackageNotFoundError:
|
|
26
|
+
PACKAGE_VERSION = "0.0.0"
|
|
27
|
+
|
|
20
28
|
|
|
21
29
|
def _env_int(name: str, default: int) -> int:
|
|
22
30
|
value = os.getenv(name)
|
|
23
31
|
return int(value) if value else default
|
|
24
32
|
|
|
25
33
|
|
|
34
|
+
def _positive_int(value: int, *, name: str) -> int:
|
|
35
|
+
if value < 1:
|
|
36
|
+
raise ValueError(f"{name} must be greater than 0.")
|
|
37
|
+
return value
|
|
38
|
+
|
|
39
|
+
|
|
26
40
|
def _parse_json_mapping(value: Optional[str], *, source: str) -> dict[str, Any]:
|
|
27
41
|
if not value:
|
|
28
42
|
return {}
|
|
@@ -51,6 +65,15 @@ PORT = _env_int("PORT", 5055)
|
|
|
51
65
|
EMBEDDING_MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME")
|
|
52
66
|
EMBEDDING_DIR = os.getenv("EMBEDDING_DIR")
|
|
53
67
|
DEVICE = os.getenv("DEVICE")
|
|
68
|
+
MAX_LOADED_MODELS = _positive_int(_env_int("MAX_LOADED_MODELS", 1), name="MAX_LOADED_MODELS")
|
|
69
|
+
MAX_INPUTS_PER_REQUEST = _positive_int(
|
|
70
|
+
_env_int("MAX_INPUTS_PER_REQUEST", 128),
|
|
71
|
+
name="MAX_INPUTS_PER_REQUEST",
|
|
72
|
+
)
|
|
73
|
+
EMBEDDING_BATCH_SIZE = _positive_int(
|
|
74
|
+
_env_int("EMBEDDING_BATCH_SIZE", 32),
|
|
75
|
+
name="EMBEDDING_BATCH_SIZE",
|
|
76
|
+
)
|
|
54
77
|
MODEL_KWARGS = _parse_json_mapping(os.getenv("MODEL_KWARGS"), source="MODEL_KWARGS")
|
|
55
78
|
ENCODE_KWARGS = _parse_json_mapping(os.getenv("ENCODE_KWARGS"), source="ENCODE_KWARGS")
|
|
56
79
|
|
|
@@ -76,11 +99,15 @@ class HealthResponse(BaseModel):
|
|
|
76
99
|
status: str
|
|
77
100
|
model: str
|
|
78
101
|
device: Optional[str]
|
|
102
|
+
loaded_models: int
|
|
103
|
+
max_loaded_models: int
|
|
104
|
+
max_inputs_per_request: int
|
|
105
|
+
embedding_batch_size: int
|
|
79
106
|
|
|
80
107
|
|
|
81
108
|
class EmbeddingService:
|
|
82
109
|
def __init__(self) -> None:
|
|
83
|
-
self.embed_models:
|
|
110
|
+
self.embed_models: OrderedDict[str, HuggingFaceEmbeddings] = OrderedDict()
|
|
84
111
|
self.lock = asyncio.Lock()
|
|
85
112
|
|
|
86
113
|
def _resolve_model_name(self, model_name: Optional[str] = None) -> str:
|
|
@@ -109,6 +136,37 @@ class EmbeddingService:
|
|
|
109
136
|
separators=(",", ":"),
|
|
110
137
|
)
|
|
111
138
|
|
|
139
|
+
def _clear_cuda_cache(self) -> None:
|
|
140
|
+
try:
|
|
141
|
+
import torch
|
|
142
|
+
except ImportError:
|
|
143
|
+
return
|
|
144
|
+
|
|
145
|
+
if torch.cuda.is_available():
|
|
146
|
+
torch.cuda.empty_cache()
|
|
147
|
+
|
|
148
|
+
def _release_model(self, embed_model: HuggingFaceEmbeddings) -> None:
|
|
149
|
+
client = getattr(embed_model, "client", None)
|
|
150
|
+
if client is not None and hasattr(client, "to"):
|
|
151
|
+
try:
|
|
152
|
+
client.to("cpu")
|
|
153
|
+
except Exception:
|
|
154
|
+
logger.debug("Failed to move evicted embedding model to CPU.", exc_info=True)
|
|
155
|
+
|
|
156
|
+
del embed_model
|
|
157
|
+
gc.collect()
|
|
158
|
+
self._clear_cuda_cache()
|
|
159
|
+
|
|
160
|
+
def _evict_extra_models(self) -> None:
|
|
161
|
+
while len(self.embed_models) > MAX_LOADED_MODELS:
|
|
162
|
+
_, evicted_model = self.embed_models.popitem(last=False)
|
|
163
|
+
logger.info(
|
|
164
|
+
"Evicting embedding model from cache. Loaded models now: %s/%s.",
|
|
165
|
+
len(self.embed_models),
|
|
166
|
+
MAX_LOADED_MODELS,
|
|
167
|
+
)
|
|
168
|
+
self._release_model(evicted_model)
|
|
169
|
+
|
|
112
170
|
def load(
|
|
113
171
|
self,
|
|
114
172
|
model_name: Optional[str] = None,
|
|
@@ -127,7 +185,11 @@ class EmbeddingService:
|
|
|
127
185
|
MODEL_KWARGS,
|
|
128
186
|
model_kwargs,
|
|
129
187
|
)
|
|
130
|
-
resolved_encode_kwargs = _merge_mappings(
|
|
188
|
+
resolved_encode_kwargs = _merge_mappings(
|
|
189
|
+
{"batch_size": EMBEDDING_BATCH_SIZE},
|
|
190
|
+
ENCODE_KWARGS,
|
|
191
|
+
encode_kwargs,
|
|
192
|
+
)
|
|
131
193
|
cache_key = self._cache_key(
|
|
132
194
|
resolved_model_name,
|
|
133
195
|
resolved_embedding_dir,
|
|
@@ -135,6 +197,7 @@ class EmbeddingService:
|
|
|
135
197
|
resolved_encode_kwargs,
|
|
136
198
|
)
|
|
137
199
|
if cache_key in self.embed_models:
|
|
200
|
+
self.embed_models.move_to_end(cache_key)
|
|
138
201
|
return self.embed_models[cache_key]
|
|
139
202
|
|
|
140
203
|
logger.info(
|
|
@@ -149,6 +212,12 @@ class EmbeddingService:
|
|
|
149
212
|
cache_folder=resolved_embedding_dir,
|
|
150
213
|
)
|
|
151
214
|
self.embed_models[cache_key] = embed_model
|
|
215
|
+
logger.info(
|
|
216
|
+
"Loaded embedding models: %s/%s.",
|
|
217
|
+
len(self.embed_models),
|
|
218
|
+
MAX_LOADED_MODELS,
|
|
219
|
+
)
|
|
220
|
+
self._evict_extra_models()
|
|
152
221
|
return embed_model
|
|
153
222
|
|
|
154
223
|
async def embed_documents(
|
|
@@ -159,15 +228,14 @@ class EmbeddingService:
|
|
|
159
228
|
model_kwargs: Optional[dict[str, Any]] = None,
|
|
160
229
|
encode_kwargs: Optional[dict[str, Any]] = None,
|
|
161
230
|
) -> list[list[float]]:
|
|
162
|
-
|
|
163
|
-
model_name,
|
|
164
|
-
embedding_dir=embedding_dir,
|
|
165
|
-
model_kwargs=model_kwargs,
|
|
166
|
-
encode_kwargs=encode_kwargs,
|
|
167
|
-
)
|
|
168
|
-
|
|
169
|
-
# Serialize GPU access to avoid VRAM spikes from concurrent requests.
|
|
231
|
+
# Serialize model loading and GPU access to avoid duplicate loads and VRAM spikes.
|
|
170
232
|
async with self.lock:
|
|
233
|
+
embed_model = self.load(
|
|
234
|
+
model_name,
|
|
235
|
+
embedding_dir=embedding_dir,
|
|
236
|
+
model_kwargs=model_kwargs,
|
|
237
|
+
encode_kwargs=encode_kwargs,
|
|
238
|
+
)
|
|
171
239
|
return await asyncio.to_thread(embed_model.embed_documents, texts)
|
|
172
240
|
|
|
173
241
|
async def embed_query(
|
|
@@ -178,14 +246,13 @@ class EmbeddingService:
|
|
|
178
246
|
model_kwargs: Optional[dict[str, Any]] = None,
|
|
179
247
|
encode_kwargs: Optional[dict[str, Any]] = None,
|
|
180
248
|
) -> list[float]:
|
|
181
|
-
embed_model = self.load(
|
|
182
|
-
model_name,
|
|
183
|
-
embedding_dir=embedding_dir,
|
|
184
|
-
model_kwargs=model_kwargs,
|
|
185
|
-
encode_kwargs=encode_kwargs,
|
|
186
|
-
)
|
|
187
|
-
|
|
188
249
|
async with self.lock:
|
|
250
|
+
embed_model = self.load(
|
|
251
|
+
model_name,
|
|
252
|
+
embedding_dir=embedding_dir,
|
|
253
|
+
model_kwargs=model_kwargs,
|
|
254
|
+
encode_kwargs=encode_kwargs,
|
|
255
|
+
)
|
|
189
256
|
return await asyncio.to_thread(embed_model.embed_query, text)
|
|
190
257
|
|
|
191
258
|
|
|
@@ -199,7 +266,7 @@ async def lifespan(_: FastAPI):
|
|
|
199
266
|
yield
|
|
200
267
|
|
|
201
268
|
|
|
202
|
-
app = FastAPI(title="Shared Embedding Service", version=
|
|
269
|
+
app = FastAPI(title="Shared Embedding Service", version=PACKAGE_VERSION, lifespan=lifespan)
|
|
203
270
|
|
|
204
271
|
|
|
205
272
|
@app.get("/health", response_model=HealthResponse)
|
|
@@ -221,6 +288,10 @@ async def health() -> HealthResponse:
|
|
|
221
288
|
status="ok",
|
|
222
289
|
model=loaded_model_name,
|
|
223
290
|
device=DEVICE,
|
|
291
|
+
loaded_models=len(svc.embed_models),
|
|
292
|
+
max_loaded_models=MAX_LOADED_MODELS,
|
|
293
|
+
max_inputs_per_request=MAX_INPUTS_PER_REQUEST,
|
|
294
|
+
embedding_batch_size=EMBEDDING_BATCH_SIZE,
|
|
224
295
|
)
|
|
225
296
|
|
|
226
297
|
|
|
@@ -231,6 +302,12 @@ async def embed(req: EmbeddingRequest) -> EmbeddingResponse:
|
|
|
231
302
|
if not texts or any(not isinstance(text, str) or not text.strip() for text in texts):
|
|
232
303
|
raise HTTPException(status_code=400, detail="Input must contain non-empty strings")
|
|
233
304
|
|
|
305
|
+
if len(texts) > MAX_INPUTS_PER_REQUEST:
|
|
306
|
+
raise HTTPException(
|
|
307
|
+
status_code=413,
|
|
308
|
+
detail=f"Too many inputs. Maximum is {MAX_INPUTS_PER_REQUEST} strings per request.",
|
|
309
|
+
)
|
|
310
|
+
|
|
234
311
|
resolved_model_name = (req.model_name or EMBEDDING_MODEL_NAME or "").strip()
|
|
235
312
|
if not resolved_model_name:
|
|
236
313
|
raise HTTPException(
|
|
@@ -283,6 +360,9 @@ def configure_runtime(
|
|
|
283
360
|
embedding_model_name: Optional[str],
|
|
284
361
|
embedding_dir: Optional[str],
|
|
285
362
|
device: Optional[str],
|
|
363
|
+
max_loaded_models: int,
|
|
364
|
+
max_inputs_per_request: int,
|
|
365
|
+
embedding_batch_size: int,
|
|
286
366
|
model_kwargs: dict[str, Any],
|
|
287
367
|
encode_kwargs: dict[str, Any],
|
|
288
368
|
) -> None:
|
|
@@ -291,6 +371,9 @@ def configure_runtime(
|
|
|
291
371
|
global EMBEDDING_MODEL_NAME
|
|
292
372
|
global EMBEDDING_DIR
|
|
293
373
|
global DEVICE
|
|
374
|
+
global MAX_LOADED_MODELS
|
|
375
|
+
global MAX_INPUTS_PER_REQUEST
|
|
376
|
+
global EMBEDDING_BATCH_SIZE
|
|
294
377
|
global MODEL_KWARGS
|
|
295
378
|
global ENCODE_KWARGS
|
|
296
379
|
|
|
@@ -299,6 +382,12 @@ def configure_runtime(
|
|
|
299
382
|
EMBEDDING_MODEL_NAME = embedding_model_name
|
|
300
383
|
EMBEDDING_DIR = embedding_dir
|
|
301
384
|
DEVICE = device
|
|
385
|
+
MAX_LOADED_MODELS = _positive_int(max_loaded_models, name="max_loaded_models")
|
|
386
|
+
MAX_INPUTS_PER_REQUEST = _positive_int(
|
|
387
|
+
max_inputs_per_request,
|
|
388
|
+
name="max_inputs_per_request",
|
|
389
|
+
)
|
|
390
|
+
EMBEDDING_BATCH_SIZE = _positive_int(embedding_batch_size, name="embedding_batch_size")
|
|
302
391
|
MODEL_KWARGS = model_kwargs
|
|
303
392
|
ENCODE_KWARGS = encode_kwargs
|
|
304
393
|
|
|
@@ -328,6 +417,24 @@ def parse_args(argv: Optional[list[str]] = None) -> argparse.Namespace:
|
|
|
328
417
|
default=DEVICE,
|
|
329
418
|
help="Torch device passed to HuggingFaceEmbeddings, for example cpu or cuda.",
|
|
330
419
|
)
|
|
420
|
+
parser.add_argument(
|
|
421
|
+
"--max-loaded-models",
|
|
422
|
+
type=int,
|
|
423
|
+
default=MAX_LOADED_MODELS,
|
|
424
|
+
help="Maximum number of embedding model instances to keep loaded.",
|
|
425
|
+
)
|
|
426
|
+
parser.add_argument(
|
|
427
|
+
"--max-inputs-per-request",
|
|
428
|
+
type=int,
|
|
429
|
+
default=MAX_INPUTS_PER_REQUEST,
|
|
430
|
+
help="Maximum number of strings accepted in one /embed request.",
|
|
431
|
+
)
|
|
432
|
+
parser.add_argument(
|
|
433
|
+
"--embedding-batch-size",
|
|
434
|
+
type=int,
|
|
435
|
+
default=EMBEDDING_BATCH_SIZE,
|
|
436
|
+
help="Default batch_size passed to the embedding model encoder.",
|
|
437
|
+
)
|
|
331
438
|
parser.add_argument(
|
|
332
439
|
"--model-kwargs",
|
|
333
440
|
default=json.dumps(MODEL_KWARGS) if MODEL_KWARGS else None,
|
|
@@ -355,6 +462,9 @@ def main(argv: Optional[list[str]] = None) -> None:
|
|
|
355
462
|
embedding_model_name=args.model_name,
|
|
356
463
|
embedding_dir=args.embedding_dir,
|
|
357
464
|
device=args.device,
|
|
465
|
+
max_loaded_models=args.max_loaded_models,
|
|
466
|
+
max_inputs_per_request=args.max_inputs_per_request,
|
|
467
|
+
embedding_batch_size=args.embedding_batch_size,
|
|
358
468
|
model_kwargs=model_kwargs,
|
|
359
469
|
encode_kwargs=encode_kwargs,
|
|
360
470
|
)
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: remote-embedding
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.0
|
|
4
4
|
Summary: A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.
|
|
5
5
|
Author: Meshkat Shariat Bagheri
|
|
6
6
|
License-Expression: MIT
|
|
@@ -60,6 +60,9 @@ PowerShell:
|
|
|
60
60
|
$env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
|
|
61
61
|
$env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
|
|
62
62
|
$env:DEVICE="cpu"
|
|
63
|
+
$env:MAX_LOADED_MODELS="1"
|
|
64
|
+
$env:MAX_INPUTS_PER_REQUEST="128"
|
|
65
|
+
$env:EMBEDDING_BATCH_SIZE="32"
|
|
63
66
|
$env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
|
|
64
67
|
$env:ENCODE_KWARGS='{"normalize_embeddings": true}'
|
|
65
68
|
```
|
|
@@ -70,6 +73,9 @@ Bash:
|
|
|
70
73
|
export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
|
|
71
74
|
export EMBEDDING_DIR=/path/to/model-cache
|
|
72
75
|
export DEVICE=cpu
|
|
76
|
+
export MAX_LOADED_MODELS=1
|
|
77
|
+
export MAX_INPUTS_PER_REQUEST=128
|
|
78
|
+
export EMBEDDING_BATCH_SIZE=32
|
|
73
79
|
export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
|
|
74
80
|
export ENCODE_KWARGS='{"normalize_embeddings": true}'
|
|
75
81
|
```
|
|
@@ -83,6 +89,9 @@ remote-embedding-server \
|
|
|
83
89
|
--model-name BAAI/bge-base-en-v1.5 \
|
|
84
90
|
--embedding-dir /path/to/model-cache \
|
|
85
91
|
--device cuda \
|
|
92
|
+
--max-loaded-models 1 \
|
|
93
|
+
--max-inputs-per-request 128 \
|
|
94
|
+
--embedding-batch-size 32 \
|
|
86
95
|
--model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
|
|
87
96
|
--encode-kwargs '{"normalize_embeddings": true}'
|
|
88
97
|
```
|
|
@@ -115,6 +124,9 @@ Server configuration:
|
|
|
115
124
|
- `EMBEDDING_MODEL_NAME`: default model to preload and use when a request does not pass `model_name`
|
|
116
125
|
- `EMBEDDING_DIR`: optional local cache/model directory for Hugging Face downloads or local files
|
|
117
126
|
- `DEVICE`: device passed to `HuggingFaceEmbeddings`, such as `cpu` or `cuda`
|
|
127
|
+
- `MAX_LOADED_MODELS`: maximum number of embedding model instances kept in memory, default `1`
|
|
128
|
+
- `MAX_INPUTS_PER_REQUEST`: maximum number of strings accepted in one `/embed` request, default `128`
|
|
129
|
+
- `EMBEDDING_BATCH_SIZE`: default encoder `batch_size`, default `32`
|
|
118
130
|
- `MODEL_KWARGS`: JSON object merged into `HuggingFaceEmbeddings(..., model_kwargs=...)`
|
|
119
131
|
- `ENCODE_KWARGS`: JSON object passed to `HuggingFaceEmbeddings(..., encode_kwargs=...)`
|
|
120
132
|
|
|
@@ -130,7 +142,7 @@ Client configuration through `RemoteEmbeddings(...)`:
|
|
|
130
142
|
|
|
131
143
|
If `EMBEDDING_MODEL_NAME` is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.
|
|
132
144
|
|
|
133
|
-
`model_kwargs` and `encode_kwargs` become part of the server-side model cache key.
|
|
145
|
+
`model_kwargs` and `encode_kwargs` become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once `MAX_LOADED_MODELS` is exceeded, and defaults to keeping one model loaded to protect GPU memory.
|
|
134
146
|
|
|
135
147
|
## Use The Client
|
|
136
148
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
{remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding.egg-info/dependency_links.txt
RENAMED
|
File without changes
|
{remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding.egg-info/entry_points.txt
RENAMED
|
File without changes
|
{remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding.egg-info/requires.txt
RENAMED
|
File without changes
|
{remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding.egg-info/top_level.txt
RENAMED
|
File without changes
|