remote-embedding 0.2.1__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: remote-embedding
3
- Version: 0.2.1
3
+ Version: 0.3.0
4
4
  Summary: A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.
5
5
  Author: Meshkat Shariat Bagheri
6
6
  License-Expression: MIT
@@ -60,6 +60,9 @@ PowerShell:
60
60
  $env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
61
61
  $env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
62
62
  $env:DEVICE="cpu"
63
+ $env:MAX_LOADED_MODELS="1"
64
+ $env:MAX_INPUTS_PER_REQUEST="128"
65
+ $env:EMBEDDING_BATCH_SIZE="32"
63
66
  $env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
64
67
  $env:ENCODE_KWARGS='{"normalize_embeddings": true}'
65
68
  ```
@@ -70,6 +73,9 @@ Bash:
70
73
  export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
71
74
  export EMBEDDING_DIR=/path/to/model-cache
72
75
  export DEVICE=cpu
76
+ export MAX_LOADED_MODELS=1
77
+ export MAX_INPUTS_PER_REQUEST=128
78
+ export EMBEDDING_BATCH_SIZE=32
73
79
  export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
74
80
  export ENCODE_KWARGS='{"normalize_embeddings": true}'
75
81
  ```
@@ -83,6 +89,9 @@ remote-embedding-server \
83
89
  --model-name BAAI/bge-base-en-v1.5 \
84
90
  --embedding-dir /path/to/model-cache \
85
91
  --device cuda \
92
+ --max-loaded-models 1 \
93
+ --max-inputs-per-request 128 \
94
+ --embedding-batch-size 32 \
86
95
  --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
87
96
  --encode-kwargs '{"normalize_embeddings": true}'
88
97
  ```
@@ -115,6 +124,9 @@ Server configuration:
115
124
  - `EMBEDDING_MODEL_NAME`: default model to preload and use when a request does not pass `model_name`
116
125
  - `EMBEDDING_DIR`: optional local cache/model directory for Hugging Face downloads or local files
117
126
  - `DEVICE`: device passed to `HuggingFaceEmbeddings`, such as `cpu` or `cuda`
127
+ - `MAX_LOADED_MODELS`: maximum number of embedding model instances kept in memory, default `1`
128
+ - `MAX_INPUTS_PER_REQUEST`: maximum number of strings accepted in one `/embed` request, default `128`
129
+ - `EMBEDDING_BATCH_SIZE`: default encoder `batch_size`, default `32`
118
130
  - `MODEL_KWARGS`: JSON object merged into `HuggingFaceEmbeddings(..., model_kwargs=...)`
119
131
  - `ENCODE_KWARGS`: JSON object passed to `HuggingFaceEmbeddings(..., encode_kwargs=...)`
120
132
 
@@ -130,7 +142,7 @@ Client configuration through `RemoteEmbeddings(...)`:
130
142
 
131
143
  If `EMBEDDING_MODEL_NAME` is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.
132
144
 
133
- `model_kwargs` and `encode_kwargs` become part of the server-side model cache key. That means different combinations can create different loaded embedding instances, which is flexible but can reduce the VRAM-sharing benefit if overused.
145
+ `model_kwargs` and `encode_kwargs` become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once `MAX_LOADED_MODELS` is exceeded, and defaults to keeping one model loaded to protect GPU memory.
134
146
 
135
147
  ## Use The Client
136
148
 
@@ -31,6 +31,9 @@ PowerShell:
31
31
  $env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
32
32
  $env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
33
33
  $env:DEVICE="cpu"
34
+ $env:MAX_LOADED_MODELS="1"
35
+ $env:MAX_INPUTS_PER_REQUEST="128"
36
+ $env:EMBEDDING_BATCH_SIZE="32"
34
37
  $env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
35
38
  $env:ENCODE_KWARGS='{"normalize_embeddings": true}'
36
39
  ```
@@ -41,6 +44,9 @@ Bash:
41
44
  export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
42
45
  export EMBEDDING_DIR=/path/to/model-cache
43
46
  export DEVICE=cpu
47
+ export MAX_LOADED_MODELS=1
48
+ export MAX_INPUTS_PER_REQUEST=128
49
+ export EMBEDDING_BATCH_SIZE=32
44
50
  export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
45
51
  export ENCODE_KWARGS='{"normalize_embeddings": true}'
46
52
  ```
@@ -54,6 +60,9 @@ remote-embedding-server \
54
60
  --model-name BAAI/bge-base-en-v1.5 \
55
61
  --embedding-dir /path/to/model-cache \
56
62
  --device cuda \
63
+ --max-loaded-models 1 \
64
+ --max-inputs-per-request 128 \
65
+ --embedding-batch-size 32 \
57
66
  --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
58
67
  --encode-kwargs '{"normalize_embeddings": true}'
59
68
  ```
@@ -86,6 +95,9 @@ Server configuration:
86
95
  - `EMBEDDING_MODEL_NAME`: default model to preload and use when a request does not pass `model_name`
87
96
  - `EMBEDDING_DIR`: optional local cache/model directory for Hugging Face downloads or local files
88
97
  - `DEVICE`: device passed to `HuggingFaceEmbeddings`, such as `cpu` or `cuda`
98
+ - `MAX_LOADED_MODELS`: maximum number of embedding model instances kept in memory, default `1`
99
+ - `MAX_INPUTS_PER_REQUEST`: maximum number of strings accepted in one `/embed` request, default `128`
100
+ - `EMBEDDING_BATCH_SIZE`: default encoder `batch_size`, default `32`
89
101
  - `MODEL_KWARGS`: JSON object merged into `HuggingFaceEmbeddings(..., model_kwargs=...)`
90
102
  - `ENCODE_KWARGS`: JSON object passed to `HuggingFaceEmbeddings(..., encode_kwargs=...)`
91
103
 
@@ -101,7 +113,7 @@ Client configuration through `RemoteEmbeddings(...)`:
101
113
 
102
114
  If `EMBEDDING_MODEL_NAME` is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.
103
115
 
104
- `model_kwargs` and `encode_kwargs` become part of the server-side model cache key. That means different combinations can create different loaded embedding instances, which is flexible but can reduce the VRAM-sharing benefit if overused.
116
+ `model_kwargs` and `encode_kwargs` become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once `MAX_LOADED_MODELS` is exceeded, and defaults to keeping one model loaded to protect GPU memory.
105
117
 
106
118
  ## Use The Client
107
119
 
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "remote-embedding"
7
- version = "0.2.1"
7
+ version = "0.3.0"
8
8
  description = "A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs."
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.10"
@@ -0,0 +1,12 @@
1
+ """Public package exports for remote-embedding."""
2
+
3
+ from importlib.metadata import PackageNotFoundError, version
4
+
5
+ from .remote import RemoteEmbeddings
6
+
7
+ __all__ = ["RemoteEmbeddings"]
8
+
9
+ try:
10
+ __version__ = version("remote-embedding")
11
+ except PackageNotFoundError:
12
+ __version__ = "0.0.0"
@@ -2,10 +2,13 @@
2
2
 
3
3
  import asyncio
4
4
  import argparse
5
+ import gc
5
6
  import json
6
7
  import logging
7
8
  import os
9
+ from collections import OrderedDict
8
10
  from contextlib import asynccontextmanager
11
+ from importlib.metadata import PackageNotFoundError, version
9
12
  from typing import Any, Literal, Optional, Union
10
13
 
11
14
  import uvicorn
@@ -17,12 +20,23 @@ from pydantic import BaseModel, Field
17
20
  load_dotenv()
18
21
  logger = logging.getLogger("remote_embedding.server")
19
22
 
23
+ try:
24
+ PACKAGE_VERSION = version("remote-embedding")
25
+ except PackageNotFoundError:
26
+ PACKAGE_VERSION = "0.0.0"
27
+
20
28
 
21
29
  def _env_int(name: str, default: int) -> int:
22
30
  value = os.getenv(name)
23
31
  return int(value) if value else default
24
32
 
25
33
 
34
+ def _positive_int(value: int, *, name: str) -> int:
35
+ if value < 1:
36
+ raise ValueError(f"{name} must be greater than 0.")
37
+ return value
38
+
39
+
26
40
  def _parse_json_mapping(value: Optional[str], *, source: str) -> dict[str, Any]:
27
41
  if not value:
28
42
  return {}
@@ -51,6 +65,15 @@ PORT = _env_int("PORT", 5055)
51
65
  EMBEDDING_MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME")
52
66
  EMBEDDING_DIR = os.getenv("EMBEDDING_DIR")
53
67
  DEVICE = os.getenv("DEVICE")
68
+ MAX_LOADED_MODELS = _positive_int(_env_int("MAX_LOADED_MODELS", 1), name="MAX_LOADED_MODELS")
69
+ MAX_INPUTS_PER_REQUEST = _positive_int(
70
+ _env_int("MAX_INPUTS_PER_REQUEST", 128),
71
+ name="MAX_INPUTS_PER_REQUEST",
72
+ )
73
+ EMBEDDING_BATCH_SIZE = _positive_int(
74
+ _env_int("EMBEDDING_BATCH_SIZE", 32),
75
+ name="EMBEDDING_BATCH_SIZE",
76
+ )
54
77
  MODEL_KWARGS = _parse_json_mapping(os.getenv("MODEL_KWARGS"), source="MODEL_KWARGS")
55
78
  ENCODE_KWARGS = _parse_json_mapping(os.getenv("ENCODE_KWARGS"), source="ENCODE_KWARGS")
56
79
 
@@ -76,11 +99,15 @@ class HealthResponse(BaseModel):
76
99
  status: str
77
100
  model: str
78
101
  device: Optional[str]
102
+ loaded_models: int
103
+ max_loaded_models: int
104
+ max_inputs_per_request: int
105
+ embedding_batch_size: int
79
106
 
80
107
 
81
108
  class EmbeddingService:
82
109
  def __init__(self) -> None:
83
- self.embed_models: dict[str, HuggingFaceEmbeddings] = {}
110
+ self.embed_models: OrderedDict[str, HuggingFaceEmbeddings] = OrderedDict()
84
111
  self.lock = asyncio.Lock()
85
112
 
86
113
  def _resolve_model_name(self, model_name: Optional[str] = None) -> str:
@@ -109,6 +136,37 @@ class EmbeddingService:
109
136
  separators=(",", ":"),
110
137
  )
111
138
 
139
+ def _clear_cuda_cache(self) -> None:
140
+ try:
141
+ import torch
142
+ except ImportError:
143
+ return
144
+
145
+ if torch.cuda.is_available():
146
+ torch.cuda.empty_cache()
147
+
148
+ def _release_model(self, embed_model: HuggingFaceEmbeddings) -> None:
149
+ client = getattr(embed_model, "client", None)
150
+ if client is not None and hasattr(client, "to"):
151
+ try:
152
+ client.to("cpu")
153
+ except Exception:
154
+ logger.debug("Failed to move evicted embedding model to CPU.", exc_info=True)
155
+
156
+ del embed_model
157
+ gc.collect()
158
+ self._clear_cuda_cache()
159
+
160
+ def _evict_extra_models(self) -> None:
161
+ while len(self.embed_models) > MAX_LOADED_MODELS:
162
+ _, evicted_model = self.embed_models.popitem(last=False)
163
+ logger.info(
164
+ "Evicting embedding model from cache. Loaded models now: %s/%s.",
165
+ len(self.embed_models),
166
+ MAX_LOADED_MODELS,
167
+ )
168
+ self._release_model(evicted_model)
169
+
112
170
  def load(
113
171
  self,
114
172
  model_name: Optional[str] = None,
@@ -127,7 +185,11 @@ class EmbeddingService:
127
185
  MODEL_KWARGS,
128
186
  model_kwargs,
129
187
  )
130
- resolved_encode_kwargs = _merge_mappings(ENCODE_KWARGS, encode_kwargs)
188
+ resolved_encode_kwargs = _merge_mappings(
189
+ {"batch_size": EMBEDDING_BATCH_SIZE},
190
+ ENCODE_KWARGS,
191
+ encode_kwargs,
192
+ )
131
193
  cache_key = self._cache_key(
132
194
  resolved_model_name,
133
195
  resolved_embedding_dir,
@@ -135,6 +197,7 @@ class EmbeddingService:
135
197
  resolved_encode_kwargs,
136
198
  )
137
199
  if cache_key in self.embed_models:
200
+ self.embed_models.move_to_end(cache_key)
138
201
  return self.embed_models[cache_key]
139
202
 
140
203
  logger.info(
@@ -149,6 +212,12 @@ class EmbeddingService:
149
212
  cache_folder=resolved_embedding_dir,
150
213
  )
151
214
  self.embed_models[cache_key] = embed_model
215
+ logger.info(
216
+ "Loaded embedding models: %s/%s.",
217
+ len(self.embed_models),
218
+ MAX_LOADED_MODELS,
219
+ )
220
+ self._evict_extra_models()
152
221
  return embed_model
153
222
 
154
223
  async def embed_documents(
@@ -159,15 +228,14 @@ class EmbeddingService:
159
228
  model_kwargs: Optional[dict[str, Any]] = None,
160
229
  encode_kwargs: Optional[dict[str, Any]] = None,
161
230
  ) -> list[list[float]]:
162
- embed_model = self.load(
163
- model_name,
164
- embedding_dir=embedding_dir,
165
- model_kwargs=model_kwargs,
166
- encode_kwargs=encode_kwargs,
167
- )
168
-
169
- # Serialize GPU access to avoid VRAM spikes from concurrent requests.
231
+ # Serialize model loading and GPU access to avoid duplicate loads and VRAM spikes.
170
232
  async with self.lock:
233
+ embed_model = self.load(
234
+ model_name,
235
+ embedding_dir=embedding_dir,
236
+ model_kwargs=model_kwargs,
237
+ encode_kwargs=encode_kwargs,
238
+ )
171
239
  return await asyncio.to_thread(embed_model.embed_documents, texts)
172
240
 
173
241
  async def embed_query(
@@ -178,14 +246,13 @@ class EmbeddingService:
178
246
  model_kwargs: Optional[dict[str, Any]] = None,
179
247
  encode_kwargs: Optional[dict[str, Any]] = None,
180
248
  ) -> list[float]:
181
- embed_model = self.load(
182
- model_name,
183
- embedding_dir=embedding_dir,
184
- model_kwargs=model_kwargs,
185
- encode_kwargs=encode_kwargs,
186
- )
187
-
188
249
  async with self.lock:
250
+ embed_model = self.load(
251
+ model_name,
252
+ embedding_dir=embedding_dir,
253
+ model_kwargs=model_kwargs,
254
+ encode_kwargs=encode_kwargs,
255
+ )
189
256
  return await asyncio.to_thread(embed_model.embed_query, text)
190
257
 
191
258
 
@@ -199,7 +266,7 @@ async def lifespan(_: FastAPI):
199
266
  yield
200
267
 
201
268
 
202
- app = FastAPI(title="Shared Embedding Service", version="0.2.1", lifespan=lifespan)
269
+ app = FastAPI(title="Shared Embedding Service", version=PACKAGE_VERSION, lifespan=lifespan)
203
270
 
204
271
 
205
272
  @app.get("/health", response_model=HealthResponse)
@@ -221,6 +288,10 @@ async def health() -> HealthResponse:
221
288
  status="ok",
222
289
  model=loaded_model_name,
223
290
  device=DEVICE,
291
+ loaded_models=len(svc.embed_models),
292
+ max_loaded_models=MAX_LOADED_MODELS,
293
+ max_inputs_per_request=MAX_INPUTS_PER_REQUEST,
294
+ embedding_batch_size=EMBEDDING_BATCH_SIZE,
224
295
  )
225
296
 
226
297
 
@@ -231,6 +302,12 @@ async def embed(req: EmbeddingRequest) -> EmbeddingResponse:
231
302
  if not texts or any(not isinstance(text, str) or not text.strip() for text in texts):
232
303
  raise HTTPException(status_code=400, detail="Input must contain non-empty strings")
233
304
 
305
+ if len(texts) > MAX_INPUTS_PER_REQUEST:
306
+ raise HTTPException(
307
+ status_code=413,
308
+ detail=f"Too many inputs. Maximum is {MAX_INPUTS_PER_REQUEST} strings per request.",
309
+ )
310
+
234
311
  resolved_model_name = (req.model_name or EMBEDDING_MODEL_NAME or "").strip()
235
312
  if not resolved_model_name:
236
313
  raise HTTPException(
@@ -283,6 +360,9 @@ def configure_runtime(
283
360
  embedding_model_name: Optional[str],
284
361
  embedding_dir: Optional[str],
285
362
  device: Optional[str],
363
+ max_loaded_models: int,
364
+ max_inputs_per_request: int,
365
+ embedding_batch_size: int,
286
366
  model_kwargs: dict[str, Any],
287
367
  encode_kwargs: dict[str, Any],
288
368
  ) -> None:
@@ -291,6 +371,9 @@ def configure_runtime(
291
371
  global EMBEDDING_MODEL_NAME
292
372
  global EMBEDDING_DIR
293
373
  global DEVICE
374
+ global MAX_LOADED_MODELS
375
+ global MAX_INPUTS_PER_REQUEST
376
+ global EMBEDDING_BATCH_SIZE
294
377
  global MODEL_KWARGS
295
378
  global ENCODE_KWARGS
296
379
 
@@ -299,6 +382,12 @@ def configure_runtime(
299
382
  EMBEDDING_MODEL_NAME = embedding_model_name
300
383
  EMBEDDING_DIR = embedding_dir
301
384
  DEVICE = device
385
+ MAX_LOADED_MODELS = _positive_int(max_loaded_models, name="max_loaded_models")
386
+ MAX_INPUTS_PER_REQUEST = _positive_int(
387
+ max_inputs_per_request,
388
+ name="max_inputs_per_request",
389
+ )
390
+ EMBEDDING_BATCH_SIZE = _positive_int(embedding_batch_size, name="embedding_batch_size")
302
391
  MODEL_KWARGS = model_kwargs
303
392
  ENCODE_KWARGS = encode_kwargs
304
393
 
@@ -328,6 +417,24 @@ def parse_args(argv: Optional[list[str]] = None) -> argparse.Namespace:
328
417
  default=DEVICE,
329
418
  help="Torch device passed to HuggingFaceEmbeddings, for example cpu or cuda.",
330
419
  )
420
+ parser.add_argument(
421
+ "--max-loaded-models",
422
+ type=int,
423
+ default=MAX_LOADED_MODELS,
424
+ help="Maximum number of embedding model instances to keep loaded.",
425
+ )
426
+ parser.add_argument(
427
+ "--max-inputs-per-request",
428
+ type=int,
429
+ default=MAX_INPUTS_PER_REQUEST,
430
+ help="Maximum number of strings accepted in one /embed request.",
431
+ )
432
+ parser.add_argument(
433
+ "--embedding-batch-size",
434
+ type=int,
435
+ default=EMBEDDING_BATCH_SIZE,
436
+ help="Default batch_size passed to the embedding model encoder.",
437
+ )
331
438
  parser.add_argument(
332
439
  "--model-kwargs",
333
440
  default=json.dumps(MODEL_KWARGS) if MODEL_KWARGS else None,
@@ -355,6 +462,9 @@ def main(argv: Optional[list[str]] = None) -> None:
355
462
  embedding_model_name=args.model_name,
356
463
  embedding_dir=args.embedding_dir,
357
464
  device=args.device,
465
+ max_loaded_models=args.max_loaded_models,
466
+ max_inputs_per_request=args.max_inputs_per_request,
467
+ embedding_batch_size=args.embedding_batch_size,
358
468
  model_kwargs=model_kwargs,
359
469
  encode_kwargs=encode_kwargs,
360
470
  )
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: remote-embedding
3
- Version: 0.2.1
3
+ Version: 0.3.0
4
4
  Summary: A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.
5
5
  Author: Meshkat Shariat Bagheri
6
6
  License-Expression: MIT
@@ -60,6 +60,9 @@ PowerShell:
60
60
  $env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
61
61
  $env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
62
62
  $env:DEVICE="cpu"
63
+ $env:MAX_LOADED_MODELS="1"
64
+ $env:MAX_INPUTS_PER_REQUEST="128"
65
+ $env:EMBEDDING_BATCH_SIZE="32"
63
66
  $env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
64
67
  $env:ENCODE_KWARGS='{"normalize_embeddings": true}'
65
68
  ```
@@ -70,6 +73,9 @@ Bash:
70
73
  export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
71
74
  export EMBEDDING_DIR=/path/to/model-cache
72
75
  export DEVICE=cpu
76
+ export MAX_LOADED_MODELS=1
77
+ export MAX_INPUTS_PER_REQUEST=128
78
+ export EMBEDDING_BATCH_SIZE=32
73
79
  export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
74
80
  export ENCODE_KWARGS='{"normalize_embeddings": true}'
75
81
  ```
@@ -83,6 +89,9 @@ remote-embedding-server \
83
89
  --model-name BAAI/bge-base-en-v1.5 \
84
90
  --embedding-dir /path/to/model-cache \
85
91
  --device cuda \
92
+ --max-loaded-models 1 \
93
+ --max-inputs-per-request 128 \
94
+ --embedding-batch-size 32 \
86
95
  --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
87
96
  --encode-kwargs '{"normalize_embeddings": true}'
88
97
  ```
@@ -115,6 +124,9 @@ Server configuration:
115
124
  - `EMBEDDING_MODEL_NAME`: default model to preload and use when a request does not pass `model_name`
116
125
  - `EMBEDDING_DIR`: optional local cache/model directory for Hugging Face downloads or local files
117
126
  - `DEVICE`: device passed to `HuggingFaceEmbeddings`, such as `cpu` or `cuda`
127
+ - `MAX_LOADED_MODELS`: maximum number of embedding model instances kept in memory, default `1`
128
+ - `MAX_INPUTS_PER_REQUEST`: maximum number of strings accepted in one `/embed` request, default `128`
129
+ - `EMBEDDING_BATCH_SIZE`: default encoder `batch_size`, default `32`
118
130
  - `MODEL_KWARGS`: JSON object merged into `HuggingFaceEmbeddings(..., model_kwargs=...)`
119
131
  - `ENCODE_KWARGS`: JSON object passed to `HuggingFaceEmbeddings(..., encode_kwargs=...)`
120
132
 
@@ -130,7 +142,7 @@ Client configuration through `RemoteEmbeddings(...)`:
130
142
 
131
143
  If `EMBEDDING_MODEL_NAME` is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.
132
144
 
133
- `model_kwargs` and `encode_kwargs` become part of the server-side model cache key. That means different combinations can create different loaded embedding instances, which is flexible but can reduce the VRAM-sharing benefit if overused.
145
+ `model_kwargs` and `encode_kwargs` become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once `MAX_LOADED_MODELS` is exceeded, and defaults to keeping one model loaded to protect GPU memory.
134
146
 
135
147
  ## Use The Client
136
148
 
@@ -1,6 +0,0 @@
1
- """Public package exports for remote-embedding."""
2
-
3
- from .remote import RemoteEmbeddings
4
-
5
- __all__ = ["RemoteEmbeddings"]
6
- __version__ = "0.2.1"