PyPI - remote-embedding - Versions diffs - 0.2.1__tar.gz → 0.3.0__tar.gz - Mend

remote-embedding 0.2.1tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

{remote_embedding-0.2.1/src/remote_embedding.egg-info → remote_embedding-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: remote-embedding
-Version: 0.2.1
+Version: 0.3.0
 Summary: A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.
 Author: Meshkat Shariat Bagheri
 License-Expression: MIT
@@ -60,6 +60,9 @@ PowerShell:
 $env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
 $env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
 $env:DEVICE="cpu"
+$env:MAX_LOADED_MODELS="1"
+$env:MAX_INPUTS_PER_REQUEST="128"
+$env:EMBEDDING_BATCH_SIZE="32"
 $env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
 $env:ENCODE_KWARGS='{"normalize_embeddings": true}'
 ```
@@ -70,6 +73,9 @@ Bash:
 export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
 export EMBEDDING_DIR=/path/to/model-cache
 export DEVICE=cpu
+export MAX_LOADED_MODELS=1
+export MAX_INPUTS_PER_REQUEST=128
+export EMBEDDING_BATCH_SIZE=32
 export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
 export ENCODE_KWARGS='{"normalize_embeddings": true}'
 ```
@@ -83,6 +89,9 @@ remote-embedding-server \
   --model-name BAAI/bge-base-en-v1.5 \
   --embedding-dir /path/to/model-cache \
   --device cuda \
+  --max-loaded-models 1 \
+  --max-inputs-per-request 128 \
+  --embedding-batch-size 32 \
   --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
   --encode-kwargs '{"normalize_embeddings": true}'
 ```
@@ -115,6 +124,9 @@ Server configuration:
 - `EMBEDDING_MODEL_NAME`: default model to preload and use when a request does not pass `model_name`
 - `EMBEDDING_DIR`: optional local cache/model directory for Hugging Face downloads or local files
 - `DEVICE`: device passed to `HuggingFaceEmbeddings`, such as `cpu` or `cuda`
+- `MAX_LOADED_MODELS`: maximum number of embedding model instances kept in memory, default `1`
+- `MAX_INPUTS_PER_REQUEST`: maximum number of strings accepted in one `/embed` request, default `128`
+- `EMBEDDING_BATCH_SIZE`: default encoder `batch_size`, default `32`
 - `MODEL_KWARGS`: JSON object merged into `HuggingFaceEmbeddings(..., model_kwargs=...)`
 - `ENCODE_KWARGS`: JSON object passed to `HuggingFaceEmbeddings(..., encode_kwargs=...)`
@@ -130,7 +142,7 @@ Client configuration through `RemoteEmbeddings(...)`:
 If `EMBEDDING_MODEL_NAME` is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.
-`model_kwargs` and `encode_kwargs` become part of the server-side model cache key. That means different combinations can create different loaded embedding instances, which is flexible but can reduce the VRAM-sharing benefit if overused.
+`model_kwargs` and `encode_kwargs` become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once `MAX_LOADED_MODELS` is exceeded, and defaults to keeping one model loaded to protect GPU memory.
 ## Use The Client

{remote_embedding-0.2.1 → remote_embedding-0.3.0}/README.md RENAMED Viewed

@@ -31,6 +31,9 @@ PowerShell:
 $env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
 $env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
 $env:DEVICE="cpu"
+$env:MAX_LOADED_MODELS="1"
+$env:MAX_INPUTS_PER_REQUEST="128"
+$env:EMBEDDING_BATCH_SIZE="32"
 $env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
 $env:ENCODE_KWARGS='{"normalize_embeddings": true}'
 ```
@@ -41,6 +44,9 @@ Bash:
 export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
 export EMBEDDING_DIR=/path/to/model-cache
 export DEVICE=cpu
+export MAX_LOADED_MODELS=1
+export MAX_INPUTS_PER_REQUEST=128
+export EMBEDDING_BATCH_SIZE=32
 export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
 export ENCODE_KWARGS='{"normalize_embeddings": true}'
 ```
@@ -54,6 +60,9 @@ remote-embedding-server \
   --model-name BAAI/bge-base-en-v1.5 \
   --embedding-dir /path/to/model-cache \
   --device cuda \
+  --max-loaded-models 1 \
+  --max-inputs-per-request 128 \
+  --embedding-batch-size 32 \
   --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
   --encode-kwargs '{"normalize_embeddings": true}'
 ```
@@ -86,6 +95,9 @@ Server configuration:
 - `EMBEDDING_MODEL_NAME`: default model to preload and use when a request does not pass `model_name`
 - `EMBEDDING_DIR`: optional local cache/model directory for Hugging Face downloads or local files
 - `DEVICE`: device passed to `HuggingFaceEmbeddings`, such as `cpu` or `cuda`
+- `MAX_LOADED_MODELS`: maximum number of embedding model instances kept in memory, default `1`
+- `MAX_INPUTS_PER_REQUEST`: maximum number of strings accepted in one `/embed` request, default `128`
+- `EMBEDDING_BATCH_SIZE`: default encoder `batch_size`, default `32`
 - `MODEL_KWARGS`: JSON object merged into `HuggingFaceEmbeddings(..., model_kwargs=...)`
 - `ENCODE_KWARGS`: JSON object passed to `HuggingFaceEmbeddings(..., encode_kwargs=...)`
@@ -101,7 +113,7 @@ Client configuration through `RemoteEmbeddings(...)`:
 If `EMBEDDING_MODEL_NAME` is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.
-`model_kwargs` and `encode_kwargs` become part of the server-side model cache key. That means different combinations can create different loaded embedding instances, which is flexible but can reduce the VRAM-sharing benefit if overused.
+`model_kwargs` and `encode_kwargs` become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once `MAX_LOADED_MODELS` is exceeded, and defaults to keeping one model loaded to protect GPU memory.
 ## Use The Client

{remote_embedding-0.2.1 → remote_embedding-0.3.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "remote-embedding"
-version = "0.2.1"
+version = "0.3.0"
 description = "A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs."
 readme = "README.md"
 requires-python = ">=3.10"

remote_embedding-0.3.0/src/remote_embedding/__init__.py ADDED Viewed

@@ -0,0 +1,12 @@
+"""Public package exports for remote-embedding."""
+from importlib.metadata import PackageNotFoundError, version
+from .remote import RemoteEmbeddings
+__all__ = ["RemoteEmbeddings"]
+try:
+    __version__ = version("remote-embedding")
+except PackageNotFoundError:
+    __version__ = "0.0.0"

{remote_embedding-0.2.1 → remote_embedding-0.3.0}/src/remote_embedding/app.py RENAMED Viewed

@@ -2,10 +2,13 @@
 import asyncio
 import argparse
+import gc
 import json
 import logging
 import os
+from collections import OrderedDict
 from contextlib import asynccontextmanager
+from importlib.metadata import PackageNotFoundError, version
 from typing import Any, Literal, Optional, Union
 import uvicorn
@@ -17,12 +20,23 @@ from pydantic import BaseModel, Field
 load_dotenv()
 logger = logging.getLogger("remote_embedding.server")
+try:
+    PACKAGE_VERSION = version("remote-embedding")
+except PackageNotFoundError:
+    PACKAGE_VERSION = "0.0.0"
 def _env_int(name: str, default: int) -> int:
     value = os.getenv(name)
     return int(value) if value else default
+def _positive_int(value: int, *, name: str) -> int:
+    if value < 1:
+        raise ValueError(f"{name} must be greater than 0.")
+    return value
 def _parse_json_mapping(value: Optional[str], *, source: str) -> dict[str, Any]:
     if not value:
         return {}
@@ -51,6 +65,15 @@ PORT = _env_int("PORT", 5055)
 EMBEDDING_MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME")
 EMBEDDING_DIR = os.getenv("EMBEDDING_DIR")
 DEVICE = os.getenv("DEVICE")
+MAX_LOADED_MODELS = _positive_int(_env_int("MAX_LOADED_MODELS", 1), name="MAX_LOADED_MODELS")
+MAX_INPUTS_PER_REQUEST = _positive_int(
+    _env_int("MAX_INPUTS_PER_REQUEST", 128),
+    name="MAX_INPUTS_PER_REQUEST",
+)
+EMBEDDING_BATCH_SIZE = _positive_int(
+    _env_int("EMBEDDING_BATCH_SIZE", 32),
+    name="EMBEDDING_BATCH_SIZE",
+)
 MODEL_KWARGS = _parse_json_mapping(os.getenv("MODEL_KWARGS"), source="MODEL_KWARGS")
 ENCODE_KWARGS = _parse_json_mapping(os.getenv("ENCODE_KWARGS"), source="ENCODE_KWARGS")
@@ -76,11 +99,15 @@ class HealthResponse(BaseModel):
     status: str
     model: str
     device: Optional[str]
+    loaded_models: int
+    max_loaded_models: int
+    max_inputs_per_request: int
+    embedding_batch_size: int
 class EmbeddingService:
     def __init__(self) -> None:
-        self.embed_models: dict[str, HuggingFaceEmbeddings] = {}
+        self.embed_models: OrderedDict[str, HuggingFaceEmbeddings] = OrderedDict()
         self.lock = asyncio.Lock()
     def _resolve_model_name(self, model_name: Optional[str] = None) -> str:
@@ -109,6 +136,37 @@ class EmbeddingService:
             separators=(",", ":"),
         )
+    def _clear_cuda_cache(self) -> None:
+        try:
+            import torch
+        except ImportError:
+            return
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    def _release_model(self, embed_model: HuggingFaceEmbeddings) -> None:
+        client = getattr(embed_model, "client", None)
+        if client is not None and hasattr(client, "to"):
+            try:
+                client.to("cpu")
+            except Exception:
+                logger.debug("Failed to move evicted embedding model to CPU.", exc_info=True)
+        del embed_model
+        gc.collect()
+        self._clear_cuda_cache()
+    def _evict_extra_models(self) -> None:
+        while len(self.embed_models) > MAX_LOADED_MODELS:
+            _, evicted_model = self.embed_models.popitem(last=False)
+            logger.info(
+                "Evicting embedding model from cache. Loaded models now: %s/%s.",
+                len(self.embed_models),
+                MAX_LOADED_MODELS,
+            )
+            self._release_model(evicted_model)
     def load(
         self,
         model_name: Optional[str] = None,
@@ -127,7 +185,11 @@ class EmbeddingService:
             MODEL_KWARGS,
             model_kwargs,
         )
-        resolved_encode_kwargs = _merge_mappings(ENCODE_KWARGS, encode_kwargs)
+        resolved_encode_kwargs = _merge_mappings(
+            {"batch_size": EMBEDDING_BATCH_SIZE},
+            ENCODE_KWARGS,
+            encode_kwargs,
+        )
         cache_key = self._cache_key(
             resolved_model_name,
             resolved_embedding_dir,
@@ -135,6 +197,7 @@ class EmbeddingService:
             resolved_encode_kwargs,
         )
         if cache_key in self.embed_models:
+            self.embed_models.move_to_end(cache_key)
             return self.embed_models[cache_key]
         logger.info(
@@ -149,6 +212,12 @@ class EmbeddingService:
             cache_folder=resolved_embedding_dir,
         )
         self.embed_models[cache_key] = embed_model
+        logger.info(
+            "Loaded embedding models: %s/%s.",
+            len(self.embed_models),
+            MAX_LOADED_MODELS,
+        )
+        self._evict_extra_models()
         return embed_model
     async def embed_documents(
@@ -159,15 +228,14 @@ class EmbeddingService:
         model_kwargs: Optional[dict[str, Any]] = None,
         encode_kwargs: Optional[dict[str, Any]] = None,
     ) -> list[list[float]]:
-        embed_model = self.load(
-            model_name,
-            embedding_dir=embedding_dir,
-            model_kwargs=model_kwargs,
-            encode_kwargs=encode_kwargs,
-        )
-        # Serialize GPU access to avoid VRAM spikes from concurrent requests.
+        # Serialize model loading and GPU access to avoid duplicate loads and VRAM spikes.
         async with self.lock:
+            embed_model = self.load(
+                model_name,
+                embedding_dir=embedding_dir,
+                model_kwargs=model_kwargs,
+                encode_kwargs=encode_kwargs,
+            )
             return await asyncio.to_thread(embed_model.embed_documents, texts)
     async def embed_query(
@@ -178,14 +246,13 @@ class EmbeddingService:
         model_kwargs: Optional[dict[str, Any]] = None,
         encode_kwargs: Optional[dict[str, Any]] = None,
     ) -> list[float]:
-        embed_model = self.load(
-            model_name,
-            embedding_dir=embedding_dir,
-            model_kwargs=model_kwargs,
-            encode_kwargs=encode_kwargs,
-        )
         async with self.lock:
+            embed_model = self.load(
+                model_name,
+                embedding_dir=embedding_dir,
+                model_kwargs=model_kwargs,
+                encode_kwargs=encode_kwargs,
+            )
             return await asyncio.to_thread(embed_model.embed_query, text)
@@ -199,7 +266,7 @@ async def lifespan(_: FastAPI):
     yield
-app = FastAPI(title="Shared Embedding Service", version="0.2.1", lifespan=lifespan)
+app = FastAPI(title="Shared Embedding Service", version=PACKAGE_VERSION, lifespan=lifespan)
 @app.get("/health", response_model=HealthResponse)
@@ -221,6 +288,10 @@ async def health() -> HealthResponse:
         status="ok",
         model=loaded_model_name,
         device=DEVICE,
+        loaded_models=len(svc.embed_models),
+        max_loaded_models=MAX_LOADED_MODELS,
+        max_inputs_per_request=MAX_INPUTS_PER_REQUEST,
+        embedding_batch_size=EMBEDDING_BATCH_SIZE,
     )
@@ -231,6 +302,12 @@ async def embed(req: EmbeddingRequest) -> EmbeddingResponse:
     if not texts or any(not isinstance(text, str) or not text.strip() for text in texts):
         raise HTTPException(status_code=400, detail="Input must contain non-empty strings")
+    if len(texts) > MAX_INPUTS_PER_REQUEST:
+        raise HTTPException(
+            status_code=413,
+            detail=f"Too many inputs. Maximum is {MAX_INPUTS_PER_REQUEST} strings per request.",
+        )
     resolved_model_name = (req.model_name or EMBEDDING_MODEL_NAME or "").strip()
     if not resolved_model_name:
         raise HTTPException(
@@ -283,6 +360,9 @@ def configure_runtime(
     embedding_model_name: Optional[str],
     embedding_dir: Optional[str],
     device: Optional[str],
+    max_loaded_models: int,
+    max_inputs_per_request: int,
+    embedding_batch_size: int,
     model_kwargs: dict[str, Any],
     encode_kwargs: dict[str, Any],
 ) -> None:
@@ -291,6 +371,9 @@ def configure_runtime(
     global EMBEDDING_MODEL_NAME
     global EMBEDDING_DIR
     global DEVICE
+    global MAX_LOADED_MODELS
+    global MAX_INPUTS_PER_REQUEST
+    global EMBEDDING_BATCH_SIZE
     global MODEL_KWARGS
     global ENCODE_KWARGS
@@ -299,6 +382,12 @@ def configure_runtime(
     EMBEDDING_MODEL_NAME = embedding_model_name
     EMBEDDING_DIR = embedding_dir
     DEVICE = device
+    MAX_LOADED_MODELS = _positive_int(max_loaded_models, name="max_loaded_models")
+    MAX_INPUTS_PER_REQUEST = _positive_int(
+        max_inputs_per_request,
+        name="max_inputs_per_request",
+    )
+    EMBEDDING_BATCH_SIZE = _positive_int(embedding_batch_size, name="embedding_batch_size")
     MODEL_KWARGS = model_kwargs
     ENCODE_KWARGS = encode_kwargs
@@ -328,6 +417,24 @@ def parse_args(argv: Optional[list[str]] = None) -> argparse.Namespace:
         default=DEVICE,
         help="Torch device passed to HuggingFaceEmbeddings, for example cpu or cuda.",
     )
+    parser.add_argument(
+        "--max-loaded-models",
+        type=int,
+        default=MAX_LOADED_MODELS,
+        help="Maximum number of embedding model instances to keep loaded.",
+    )
+    parser.add_argument(
+        "--max-inputs-per-request",
+        type=int,
+        default=MAX_INPUTS_PER_REQUEST,
+        help="Maximum number of strings accepted in one /embed request.",
+    )
+    parser.add_argument(
+        "--embedding-batch-size",
+        type=int,
+        default=EMBEDDING_BATCH_SIZE,
+        help="Default batch_size passed to the embedding model encoder.",
+    )
     parser.add_argument(
         "--model-kwargs",
         default=json.dumps(MODEL_KWARGS) if MODEL_KWARGS else None,
@@ -355,6 +462,9 @@ def main(argv: Optional[list[str]] = None) -> None:
         embedding_model_name=args.model_name,
         embedding_dir=args.embedding_dir,
         device=args.device,
+        max_loaded_models=args.max_loaded_models,
+        max_inputs_per_request=args.max_inputs_per_request,
+        embedding_batch_size=args.embedding_batch_size,
         model_kwargs=model_kwargs,
         encode_kwargs=encode_kwargs,
     )

{remote_embedding-0.2.1 → remote_embedding-0.3.0/src/remote_embedding.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: remote-embedding
-Version: 0.2.1
+Version: 0.3.0
 Summary: A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.
 Author: Meshkat Shariat Bagheri
 License-Expression: MIT
@@ -60,6 +60,9 @@ PowerShell:
 $env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
 $env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
 $env:DEVICE="cpu"
+$env:MAX_LOADED_MODELS="1"
+$env:MAX_INPUTS_PER_REQUEST="128"
+$env:EMBEDDING_BATCH_SIZE="32"
 $env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
 $env:ENCODE_KWARGS='{"normalize_embeddings": true}'
 ```
@@ -70,6 +73,9 @@ Bash:
 export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
 export EMBEDDING_DIR=/path/to/model-cache
 export DEVICE=cpu
+export MAX_LOADED_MODELS=1
+export MAX_INPUTS_PER_REQUEST=128
+export EMBEDDING_BATCH_SIZE=32
 export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
 export ENCODE_KWARGS='{"normalize_embeddings": true}'
 ```
@@ -83,6 +89,9 @@ remote-embedding-server \
   --model-name BAAI/bge-base-en-v1.5 \
   --embedding-dir /path/to/model-cache \
   --device cuda \
+  --max-loaded-models 1 \
+  --max-inputs-per-request 128 \
+  --embedding-batch-size 32 \
   --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
   --encode-kwargs '{"normalize_embeddings": true}'
 ```
@@ -115,6 +124,9 @@ Server configuration:
 - `EMBEDDING_MODEL_NAME`: default model to preload and use when a request does not pass `model_name`
 - `EMBEDDING_DIR`: optional local cache/model directory for Hugging Face downloads or local files
 - `DEVICE`: device passed to `HuggingFaceEmbeddings`, such as `cpu` or `cuda`
+- `MAX_LOADED_MODELS`: maximum number of embedding model instances kept in memory, default `1`
+- `MAX_INPUTS_PER_REQUEST`: maximum number of strings accepted in one `/embed` request, default `128`
+- `EMBEDDING_BATCH_SIZE`: default encoder `batch_size`, default `32`
 - `MODEL_KWARGS`: JSON object merged into `HuggingFaceEmbeddings(..., model_kwargs=...)`
 - `ENCODE_KWARGS`: JSON object passed to `HuggingFaceEmbeddings(..., encode_kwargs=...)`
@@ -130,7 +142,7 @@ Client configuration through `RemoteEmbeddings(...)`:
 If `EMBEDDING_MODEL_NAME` is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.
-`model_kwargs` and `encode_kwargs` become part of the server-side model cache key. That means different combinations can create different loaded embedding instances, which is flexible but can reduce the VRAM-sharing benefit if overused.
+`model_kwargs` and `encode_kwargs` become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once `MAX_LOADED_MODELS` is exceeded, and defaults to keeping one model loaded to protect GPU memory.
 ## Use The Client

remote_embedding-0.2.1/src/remote_embedding/__init__.py DELETED Viewed

@@ -1,6 +0,0 @@
-"""Public package exports for remote-embedding."""
-from .remote import RemoteEmbeddings
-__all__ = ["RemoteEmbeddings"]
-__version__ = "0.2.1"