PyPI - npuserver - Versions diffs - 0.1.0__tar.gz - Mend

npuserver 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

npuserver-0.1.0/PKG-INFO +113 -0
npuserver-0.1.0/README.md +97 -0
npuserver-0.1.0/pyproject.toml +21 -0
npuserver-0.1.0/src/npuserver/__init__.py +4 -0
npuserver-0.1.0/src/npuserver/__main__.py +69 -0
npuserver-0.1.0/src/npuserver/compile.py +49 -0
npuserver-0.1.0/src/npuserver/server.py +582 -0

npuserver-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,113 @@
+Metadata-Version: 2.4
+Name: npuserver
+Version: 0.1.0
+Summary:
+Author: Durga Sai
+Author-email: dsainvg.20.12.24@kgpian.iitkgp.ac.in
+Requires-Python: >=3.13
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Requires-Dist: flask (>=3.0.0)
+Requires-Dist: huggingface-hub (>=0.20.0)
+Requires-Dist: openvino-genai (>=2024.5.0)
+Description-Content-Type: text/markdown
+# npuserver 🚀
+A lightweight, efficient utility library for compiling and preparing Generative AI LLM models for the **Intel NPU (Neural Processing Unit)** using OpenVINO™ GenAI.
+---
+## Features
+- **Intel NPU Optimization:** Fast, local LLM compilation designed for Intel NPU architectures.
+- **Robust Model Fallbacks:** Automated properties configuration with retry logic for prompt lengths.
+- **Hugging Face Hub Integration:** Seamless resolution and down-caching of models.
+- **Clean API Design:** Import and use directly in any Python environment.
+---
+## Installation
+### From PyPI
+```bash
+pip install npuserver
+```
+### From Source
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/yourusername/npuserver.git
+   cd npuserver
+   ```
+2. Set up a virtual environment and install:
+   ```bash
+   python -m venv .venv
+   # On Windows:
+   .venv\Scripts\activate
+   # On macOS/Linux:
+   source .venv/bin/activate
+   pip install -e .
+   ```
+---
+## Quick Start
+Compile your favorite Hugging Face LLM model for the Intel NPU:
+```python
+from pathlib import Path
+from npuserver import compile_model
+# Path to store compiled model caches
+cache_dir = Path("./npu_cache")
+cache_dir.mkdir(exist_ok=True)
+# Compile a Hugging Face LLM (e.g., Qwen or Phi)
+compile_model(
+    repo_id="Qwen/Qwen2.5-0.5B-Instruct",
+    cache_dir=cache_dir,
+    prompt_len=8192
+)
+```
+---
+## Development
+### Running with Poetry
+This library uses **Poetry** as its package manager:
+```bash
+poetry install
+poetry run python -c "import npuserver; print(npuserver.__all__)"
+```
+### Directory Structure
+```text
+npuserver/
+├── .github/workflows/   # CI/CD & Automated Publishing
+├── src/
+│   └── npuserver/
+│       ├── __init__.py  # Package entry point
+│       └── compile.py   # Core compilation functions
+├── tests/               # Test suites
+├── pyproject.toml       # Modern packaging configuration
+└── requirements.txt     # Standard pip requirements
+```
+---
+## PyPI Automatic Publishing
+The project includes an automated GitHub Actions CI/CD pipeline (`.github/workflows/publish.yml`) that builds and publishes releases securely using **OIDC Trusted Publishing**:
+1. Tag your release:
+   ```bash
+   git tag v0.1.0
+   git push origin v0.1.0
+   ```
+2. The GitHub Action will trigger, build source/wheel distributions, and push to PyPI.

npuserver-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,97 @@
+# npuserver 🚀
+A lightweight, efficient utility library for compiling and preparing Generative AI LLM models for the **Intel NPU (Neural Processing Unit)** using OpenVINO™ GenAI.
+---
+## Features
+- **Intel NPU Optimization:** Fast, local LLM compilation designed for Intel NPU architectures.
+- **Robust Model Fallbacks:** Automated properties configuration with retry logic for prompt lengths.
+- **Hugging Face Hub Integration:** Seamless resolution and down-caching of models.
+- **Clean API Design:** Import and use directly in any Python environment.
+---
+## Installation
+### From PyPI
+```bash
+pip install npuserver
+```
+### From Source
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/yourusername/npuserver.git
+   cd npuserver
+   ```
+2. Set up a virtual environment and install:
+   ```bash
+   python -m venv .venv
+   # On Windows:
+   .venv\Scripts\activate
+   # On macOS/Linux:
+   source .venv/bin/activate
+   pip install -e .
+   ```
+---
+## Quick Start
+Compile your favorite Hugging Face LLM model for the Intel NPU:
+```python
+from pathlib import Path
+from npuserver import compile_model
+# Path to store compiled model caches
+cache_dir = Path("./npu_cache")
+cache_dir.mkdir(exist_ok=True)
+# Compile a Hugging Face LLM (e.g., Qwen or Phi)
+compile_model(
+    repo_id="Qwen/Qwen2.5-0.5B-Instruct",
+    cache_dir=cache_dir,
+    prompt_len=8192
+)
+```
+---
+## Development
+### Running with Poetry
+This library uses **Poetry** as its package manager:
+```bash
+poetry install
+poetry run python -c "import npuserver; print(npuserver.__all__)"
+```
+### Directory Structure
+```text
+npuserver/
+├── .github/workflows/   # CI/CD & Automated Publishing
+├── src/
+│   └── npuserver/
+│       ├── __init__.py  # Package entry point
+│       └── compile.py   # Core compilation functions
+├── tests/               # Test suites
+├── pyproject.toml       # Modern packaging configuration
+└── requirements.txt     # Standard pip requirements
+```
+---
+## PyPI Automatic Publishing
+The project includes an automated GitHub Actions CI/CD pipeline (`.github/workflows/publish.yml`) that builds and publishes releases securely using **OIDC Trusted Publishing**:
+1. Tag your release:
+   ```bash
+   git tag v0.1.0
+   git push origin v0.1.0
+   ```
+2. The GitHub Action will trigger, build source/wheel distributions, and push to PyPI.

npuserver-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,21 @@
+[project]
+name = "npuserver"
+version = "0.1.0"
+description = ""
+authors = [
+    {name = "Durga Sai",email = "dsainvg.20.12.24@kgpian.iitkgp.ac.in"}
+]
+readme = "README.md"
+requires-python = ">=3.13"
+dependencies = [
+    "openvino-genai>=2024.5.0",
+    "huggingface-hub>=0.20.0",
+    "flask>=3.0.0"
+]
+[tool.poetry]
+packages = [{include = "npuserver", from = "src"}]
+[build-system]
+requires = ["poetry-core>=2.0.0,<3.0.0"]
+build-backend = "poetry.core.masonry.api"

npuserver-0.1.0/src/npuserver/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from .compile import compile_model
+from .server import run_server, download_and_compile
+__all__ = ["compile_model", "run_server", "download_and_compile"]

npuserver-0.1.0/src/npuserver/__main__.py ADDED Viewed

@@ -0,0 +1,69 @@
+import argparse
+import sys
+from pathlib import Path
+from npuserver import run_server, download_and_compile
+def main():
+    parser = argparse.ArgumentParser(
+        prog="npuserver",
+        description="NPU Server CLI: Download, compile, and run OpenVINO GenAI models on NPU."
+    )
+    subparsers = parser.add_subparsers(dest="command", required=True)
+    # Compile command
+    compile_parser = subparsers.add_parser("compile", help="Download and compile a model for NPU.")
+    compile_parser.add_argument("--model", required=True, help="Hugging Face model repository ID.")
+    compile_parser.add_argument("--genai-cache", help="Directory to store compiled NPU execution blobs. Defaults to ~/.cache/npuserver/compiled")
+    compile_parser.add_argument("--hf-cache", help="Directory containing raw downloaded Hugging Face model files. Defaults to ~/.cache/npuserver/hf")
+    compile_parser.add_argument("--max-prompt-len", type=int, default=16384, help="Maximum prompt length for NPU compilation.")
+    compile_parser.add_argument("--cache-mode", default="OPTIMIZE_SPEED", choices=["OPTIMIZE_SPEED", "OPTIMIZE_SIZE"], help="Compilation cache mode.")
+    compile_parser.add_argument("--disable-download", action="store_true", help="Disable downloading model files if they are not already cached locally.")
+    # Serve command
+    serve_parser = subparsers.add_parser("serve", help="Run the dynamic HTTP model server.")
+    serve_parser.add_argument("--genai-cache", help="Directory containing compiled NPU execution blobs. Defaults to ~/.cache/npuserver/compiled")
+    serve_parser.add_argument("--hf-cache", help="Directory containing raw downloaded Hugging Face model files. Defaults to ~/.cache/npuserver/hf")
+    serve_parser.add_argument("--model", help="Optional model ID to pre-load at startup.")
+    serve_parser.add_argument("--port", type=int, default=8080, help="Port to bind the HTTP server to.")
+    serve_parser.add_argument("--host", default="0.0.0.0", help="Host interface to bind the HTTP server to.")
+    serve_parser.add_argument("--disable-download", action="store_true", help="Disable on-demand model downloading during server runtime.")
+    serve_parser.add_argument("--log-file", help="Path to write server execution logs.")
+    serve_parser.add_argument("--prompt-log-file", help="Path to write raw prompt execution logs.")
+    args = parser.parse_args()
+    # Convert paths to Path objects if provided, else let the functions fallback to ~/.cache/npuserver/
+    g_cache = Path(args.genai_cache) if args.genai_cache else None
+    h_cache = Path(args.hf_cache) if args.hf_cache else None
+    if args.command == "compile":
+        print(f"Starting compilation for model: '{args.model}'")
+        try:
+            download_and_compile(
+                model_name=args.model,
+                genai_cache_root=g_cache,
+                hf_hub_cache=h_cache,
+                allow_download=not args.disable_download,
+                max_prompt_len=args.max_prompt_len,
+                cache_mode=args.cache_mode
+            )
+            print("Compilation successful.")
+        except Exception as e:
+            print(f"Error during compilation: {e}", file=sys.stderr)
+            sys.exit(1)
+    elif args.command == "serve":
+        print(f"Starting server on {args.host}:{args.port}")
+        run_server(
+            genai_cache_root=g_cache,
+            hf_hub_cache=h_cache,
+            model_name=args.model,
+            allow_download=not args.disable_download,
+            port=args.port,
+            host=args.host,
+            log_file=Path(args.log_file) if args.log_file else None,
+            prompt_log_file=Path(args.prompt_log_file) if args.prompt_log_file else None
+        )
+if __name__ == "__main__":
+    main()

npuserver-0.1.0/src/npuserver/compile.py ADDED Viewed

@@ -0,0 +1,49 @@
+import time
+from pathlib import Path
+import openvino_genai as ov_genai
+def compile_model(repo_id,cache_dir,prompt_len=8192):
+    # Resolve the model path (assuming they are already in HF cache as user said)
+    # We use snapshot_download with local_files_only=True to find the path
+    try:
+        from huggingface_hub import snapshot_download
+        model_path = snapshot_download(repo_id=repo_id, local_files_only=True)
+    except Exception as e:
+        # Try one more time without local_files_only if it's the first time
+        try:
+            model_path = snapshot_download(repo_id=repo_id)
+        except Exception as e2:
+            print(f"Failed to resolve {repo_id}: {e2}")
+            return
+    print(f"Model Path: {model_path}")
+    # Set properties for NPU compilation
+    # Note: GenAI properties need correct Python types (int for lengths)
+    config = {
+        "MAX_prompt_len": prompt_len,
+        "cache_dir": str(cache_dir / repo_id.replace("/", "--"))
+    }
+    print(f"Using properties: {config}")
+    start_time = time.time()
+    try:
+        # Initializing the pipeline with "NPU" triggers compilation
+        pipe = ov_genai.LLMPipeline(model_path, "NPU", **config)
+        print(f"Compilation successful for {repo_id}!")
+    except Exception as e:
+        print(f"Compilation failed for {repo_id} with MAX_prompt_len: {e}")
+        print("Retrying with 'NPU_MAX_prompt_len'...")
+        try:
+            config["NPU_MAX_prompt_len"] = prompt_len
+            del config["MAX_prompt_len"]
+            pipe = ov_genai.LLMPipeline(model_path, "NPU", **config)
+            print(f"Compilation successful for {repo_id} with NPU_MAX_prompt_len!")
+        except Exception as e2:
+            print(f"Compilation failed again for {repo_id}: {e2}")
+    end_time = time.time()
+    print(f"Time taken: {end_time - start_time:.2f} seconds")

npuserver-0.1.0/src/npuserver/server.py ADDED Viewed

@@ -0,0 +1,582 @@
+import gc
+import json
+import time
+import uuid
+import threading
+import logging
+from pathlib import Path
+def get_default_paths():
+    """
+    Returns the default cache and download paths inside ~/.cache/npuserver/
+    """
+    base = Path.home() / ".cache" / "npuserver"
+    return base / "compiled", base / "hf"
+def setup_logging(log_file: str | Path | None = None, prompt_log_file: str | Path | None = None):
+    """
+    Configures logging dynamically. If `log_file` or `prompt_log_file` is provided,
+    creates a FileHandler to write to the requested file path.
+    By default, logs to standard output only, with no file creation.
+    """
+    logger = logging.getLogger("npu_server")
+    prompt_logger = logging.getLogger("prompt_logger")
+    # Avoid adding duplicate handlers if setup_logging is called multiple times
+    logger.handlers.clear()
+    prompt_logger.handlers.clear()
+    logger.setLevel(logging.INFO)
+    prompt_logger.setLevel(logging.INFO)
+    formatter = logging.Formatter('%(asctime)s [%(levelname)s] %(message)s')
+    # Stream Handler for the console
+    sh = logging.StreamHandler()
+    sh.setFormatter(formatter)
+    logger.addHandler(sh)
+    if log_file:
+        fh = logging.FileHandler(log_file)
+        fh.setFormatter(formatter)
+        logger.addHandler(fh)
+    if prompt_log_file:
+        pfh = logging.FileHandler(prompt_log_file)
+        pfh.setFormatter(logging.Formatter('%(asctime)s\n%(message)s\n' + '-'*80 + '\n'))
+        prompt_logger.addHandler(pfh)
+    else:
+        prompt_logger.addHandler(logging.NullHandler())
+    return logger, prompt_logger
+def find_slot(
+    model_name: str,
+    genai_cache_root: str | Path | None = None,
+    hf_hub_cache: str | Path | None = None
+) -> tuple[Path, dict] | None:
+    """
+    Locates a compiled model slot inside the GenAI cache root.
+    If manifest.json exists, reads metadata from it; otherwise dynamically resolves the
+    raw model directory inside hf_hub_cache.
+    """
+    default_genai, default_hf = get_default_paths()
+    g_root = Path(genai_cache_root) if genai_cache_root else default_genai
+    h_cache = Path(hf_hub_cache) if hf_hub_cache else default_hf
+    if g_root.exists():
+        for slot in g_root.iterdir():
+            if slot.is_dir() and (model_name in slot.name or model_name.replace("/", "--") in slot.name):
+                # A slot is valid if it contains *.blob files or the compilation completion flag
+                if any(slot.glob("*.blob")) or (slot / "compiled.ok").exists():
+                    mp = slot / "manifest.json"
+                    if mp.exists():
+                        try:
+                            return slot, json.loads(mp.read_text())
+                        except Exception:
+                            pass
+                    try:
+                        from huggingface_hub import snapshot_download
+                        repo_id = slot.name.replace("--", "/")
+                        model_dir = snapshot_download(repo_id=repo_id, local_files_only=True, cache_dir=str(h_cache))
+                        return slot, {
+                            "model_name": model_name,
+                            "model_dir": model_dir,
+                            "device": "NPU",
+                            "max_prompt_len": 16384,
+                            "cache_mode": "OPTIMIZE_SPEED"
+                        }
+                    except Exception:
+                        pass
+    return None
+def download_and_compile(
+    model_name: str,
+    genai_cache_root: str | Path | None = None,
+    hf_hub_cache: str | Path | None = None,
+    allow_download: bool = True,
+    max_prompt_len: int = 16384,
+    cache_mode: str = "OPTIMIZE_SPEED",
+    logger=None
+) -> Path:
+    """
+    Downloads a model from Hugging Face (if not already local and allow_download=True)
+    and compiles it directly to the GenAI cache directory.
+    Returns the path to the compiled slot directory.
+    """
+    log = logger or logging.getLogger("npu_server")
+    default_genai, default_hf = get_default_paths()
+    g_root = Path(genai_cache_root) if genai_cache_root else default_genai
+    h_cache = Path(hf_hub_cache) if hf_hub_cache else default_hf
+    from huggingface_hub import snapshot_download
+    try:
+        log.info(f"Checking for local Hugging Face files for '{model_name}' in '{h_cache}'...")
+        model_dir = snapshot_download(
+            repo_id=model_name,
+            local_files_only=True,
+            cache_dir=str(h_cache)
+        )
+    except Exception as e:
+        if not allow_download:
+            log.error(f"Model '{model_name}' was not found locally in hf_hub_cache and allow_download=False.")
+            raise ValueError(f"Model '{model_name}' not found locally and downloading is disabled: {e}")
+        log.info(f"Local files not found. Fetching online from Hugging Face Hub: {e}")
+        model_dir = snapshot_download(
+            repo_id=model_name,
+            cache_dir=str(h_cache)
+        )
+    # Setup compiled slot directory (stores compiled NPU blob and manifest)
+    safe_name = model_name.replace("/", "--")
+    slot_dir = g_root / safe_name
+    slot_dir.mkdir(parents=True, exist_ok=True)
+    manifest = {
+        "model_name": model_name,
+        "model_dir": str(model_dir),
+        "device": "NPU",
+        "max_prompt_len": max_prompt_len,
+        "cache_mode": cache_mode,
+        "compiled_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+    }
+    (slot_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
+    log.info(f"Compiling model '{model_name}' on NPU...")
+    try:
+        import openvino_genai as ov_genai
+    except ImportError as e:
+        raise ImportError("[ERROR] openvino-genai not found. Run: pip install openvino-genai") from e
+    pipeline_config = {
+        "CACHE_DIR": str(slot_dir),
+        "CACHE_MODE": cache_mode,
+        "MAX_PROMPT_LEN": int(max_prompt_len),
+    }
+    # Instantiating the pipeline triggers NPU compilation, saving the compiled .blob to slot_dir
+    ov_genai.LLMPipeline(str(model_dir), "NPU", **pipeline_config)
+    (slot_dir / "compiled.ok").touch()
+    log.info(f"Successfully compiled '{model_name}' directly onto the NPU.")
+    return slot_dir
+class NPUPipeline:
+    """
+    Wrapper around OpenVINO GenAI LLMPipeline.
+    """
+    def __init__(self, slot_dir: Path, manifest: dict, logger=None):
+        try:
+            import openvino_genai as ov_genai
+        except ImportError as e:
+            raise ImportError("[ERROR] openvino-genai not found. Run: pip install openvino-genai") from e
+        self.logger = logger or logging.getLogger("npu_server")
+        model_dir = Path(manifest["model_dir"])
+        device = manifest.get("device", "NPU")
+        pipeline_config = {
+            "CACHE_DIR": str(slot_dir),
+            "CACHE_MODE": manifest.get("cache_mode", "OPTIMIZE_SPEED"),
+            "MAX_PROMPT_LEN": int(manifest.get("max_prompt_len", 16384)),
+        }
+        self.logger.info(f"Loading model from compiled blob ...")
+        self.logger.info(f"  device : {device}")
+        self.logger.info(f"  blob   : {slot_dir}")
+        t0 = time.time()
+        self._pipe = ov_genai.LLMPipeline(str(model_dir), device, **pipeline_config)
+        self._genai = ov_genai
+        self.load_ms = round((time.time() - t0) * 1000)
+        self.model_name = manifest["model_name"]
+        self.device = device
+        self.logger.info(f"Pipeline ready in {self.load_ms}ms")
+    def generate(self, prompt: str, max_new_tokens: int = 512, temperature: float = 0.7,
+                 stream_cb=None) -> str:
+        cfg = self._genai.GenerationConfig()
+        cfg.max_new_tokens = max_new_tokens
+        cfg.do_sample = temperature > 0.0
+        if temperature > 0.0:
+            cfg.temperature = temperature
+            cfg.top_p = 0.95
+        if stream_cb:
+            self._pipe.generate(prompt, cfg, stream_cb)
+            return ""
+        return self._pipe.generate(prompt, cfg)
+    def apply_chat_template(self, messages: list[dict]) -> str:
+        return self._pipe.get_tokenizer().apply_chat_template(messages, add_generation_prompt=True)
+class NPUPipelineManager:
+    """
+    Manages active NPU pipeline state, lists models dynamically, and
+    lazy-loads or compiles models on request in a thread-safe, queued manner.
+    """
+    def __init__(
+        self,
+        genai_cache_root: str | Path | None = None,
+        hf_hub_cache: str | Path | None = None,
+        allow_download: bool = True,
+        logger=None
+    ):
+        default_genai, default_hf = get_default_paths()
+        self.genai_cache_root = Path(genai_cache_root) if genai_cache_root else default_genai
+        self.hf_hub_cache = Path(hf_hub_cache) if hf_hub_cache else default_hf
+        self.allow_download = allow_download
+        self.logger = logger or logging.getLogger("npu_server")
+        self.active_pipeline = None
+        self.active_model_name = None
+        self.is_loading = False
+        # Thread condition variable to queue and sequence loading/generation requests
+        self.lock = threading.Lock()
+        self.load_condition = threading.Condition(self.lock)
+    def list_all_models(self) -> list[dict]:
+        """
+        Lists all compiled models inside the GenAI cache and raw HF models
+        available in the Hugging Face hub cache.
+        """
+        models_dict = {}
+        # 1. Identify current active model
+        if self.active_model_name:
+            models_dict[self.active_model_name] = "active"
+        # 2. Check GenAI compiled cache directory
+        if self.genai_cache_root.exists():
+            for slot in self.genai_cache_root.iterdir():
+                if slot.is_dir():
+                    mp = slot / "manifest.json"
+                    ok = slot / "compiled.ok"
+                    model_name = None
+                    if mp.exists():
+                        try:
+                            m = json.loads(mp.read_text())
+                            model_name = m.get("model_name")
+                        except Exception:
+                            pass
+                    if not model_name:
+                        model_name = slot.name.replace("--", "/")
+                    if any(slot.glob("*.blob")) or ok.exists():
+                        if model_name not in models_dict:
+                            models_dict[model_name] = "compiled"
+        # 3. Check local raw HF hub cache
+        if self.hf_hub_cache.exists():
+            for entry in self.hf_hub_cache.iterdir():
+                if entry.is_dir() and entry.name.startswith("models--"):
+                    parts = entry.name.split("--")[1:]
+                    if parts:
+                        repo_id = "/".join(parts)
+                        if repo_id not in models_dict:
+                            models_dict[repo_id] = "raw"
+        return [{"id": name, "status": status} for name, status in models_dict.items()]
+    def load_model(self, model_name: str) -> NPUPipeline:
+        """
+        Loads the requested model. If another model is currently in the process of loading,
+        blocks until loading is complete, then returns the active pipeline.
+        """
+        with self.load_condition:
+            # Queue request if server is currently compiling/loading
+            while self.is_loading:
+                self.logger.info(f"Model load in progress. Queuing request for model '{model_name}'...")
+                self.load_condition.wait()
+            # If active model matches requested one, serve immediately
+            if self.active_pipeline and self.active_model_name == model_name:
+                return self.active_pipeline
+            # Set load state to block any concurrent requests
+            self.is_loading = True
+        try:
+            self.logger.info(f"Initiating loading/compilation process for '{model_name}'...")
+            # 1. Search for compiled slot
+            result = find_slot(model_name, self.genai_cache_root, self.hf_hub_cache)
+            if result:
+                slot_dir, manifest = result
+                self.logger.info(f"Found compiled slot at {slot_dir}. Loading...")
+                # Unload old pipeline to free NPU memory
+                with self.load_condition:
+                    if self.active_pipeline:
+                        self.logger.info(f"Unloading active model '{self.active_model_name}' to free NPU resources...")
+                        self.active_pipeline = None
+                        gc.collect()
+                        time.sleep(0.5)
+                pipe = NPUPipeline(slot_dir, manifest, logger=self.logger)
+            else:
+                self.logger.info(f"Model '{model_name}' is not compiled. Running download & NPU compile...")
+                # Unload active pipeline to free NPU resources for compilation
+                with self.load_condition:
+                    if self.active_pipeline:
+                        self.logger.info(f"Unloading active model '{self.active_model_name}' before compilation...")
+                        self.active_pipeline = None
+                        gc.collect()
+                        time.sleep(0.5)
+                slot_dir = download_and_compile(
+                    model_name=model_name,
+                    genai_cache_root=self.genai_cache_root,
+                    hf_hub_cache=self.hf_hub_cache,
+                    allow_download=self.allow_download,
+                    logger=self.logger
+                )
+                manifest = json.loads((slot_dir / "manifest.json").read_text())
+                pipe = NPUPipeline(slot_dir, manifest, logger=self.logger)
+            # Store the newly active pipeline
+            with self.load_condition:
+                self.active_pipeline = pipe
+                self.active_model_name = model_name
+                self.logger.info(f"Successfully activated and loaded '{model_name}'!")
+                return self.active_pipeline
+        finally:
+            with self.load_condition:
+                # Release loading lock state and wake up all waiting threads
+                self.is_loading = False
+                self.load_condition.notify_all()
+def build_flask_app(manager: NPUPipelineManager, logger=None, prompt_logger=None):
+    """
+    Builds and returns the Flask WSGI application instance.
+    """
+    try:
+        from flask import Flask, request, jsonify, Response
+    except ImportError as e:
+        raise ImportError("[ERROR] flask not found.") from e
+    app = Flask(__name__)
+    log = logger or logging.getLogger("npu_server")
+    plog = prompt_logger or logging.getLogger("prompt_logger")
+    @app.route("/", methods=["GET"])
+    def index():
+        return jsonify({
+            "status": "running",
+            "active_model": manager.active_model_name or "none",
+            "info": "NPU Dynamic Model Server"
+        })
+    @app.route("/health", methods=["GET"])
+    def health():
+        if manager.active_pipeline:
+            return jsonify({
+                "status": "ok",
+                "model": manager.active_model_name,
+                "device": manager.active_pipeline.device
+            })
+        return jsonify({"status": "idle", "info": "No active model loaded"})
+    @app.route("/currentmodel", methods=["GET"])
+    def currentmodel():
+        """
+        Returns info about the currently loaded active model.
+        """
+        if manager.active_pipeline:
+            return jsonify({
+                "model": manager.active_model_name,
+                "device": manager.active_pipeline.device,
+                "status": "loaded"
+            })
+        return jsonify({
+            "model": "none",
+            "device": "NPU",
+            "status": "idle"
+        })
+    @app.route("/v1/models/load", methods=["POST"])
+    @app.route("/load", methods=["POST"])
+    def load_model_endpoint():
+        """
+        Receives a POST request to load a model.
+        While this compilation/load runs, other requests will block in queue.
+        """
+        data = request.get_json(force=True) if request.data else {}
+        model_name = data.get("model")
+        if not model_name:
+            return jsonify({"error": "No 'model' parameter specified in JSON body"}), 400
+        try:
+            npu = manager.load_model(model_name)
+            return jsonify({
+                "status": "success",
+                "model": npu.model_name,
+                "device": npu.device,
+                "message": f"Successfully loaded model '{model_name}'"
+            })
+        except Exception as e:
+            log.error(f"Failed to load model '{model_name}': {e}")
+            return jsonify({"error": f"Failed to load model '{model_name}': {str(e)}"}), 500
+    @app.route("/v1/models", methods=["GET"])
+    def v1_models():
+        available = manager.list_all_models()
+        return jsonify({
+            "object": "list",
+            "data": [
+                {
+                    "id": m["id"],
+                    "object": "model",
+                    "created": int(time.time()),
+                    "owned_by": "npu-local",
+                    "status": m["status"]
+                }
+                for m in available
+            ]
+        })
+    @app.after_request
+    def add_cors_headers(response):
+        response.headers["Access-Control-Allow-Origin"] = "*"
+        response.headers["Access-Control-Allow-Methods"] = "GET, POST, OPTIONS"
+        response.headers["Access-Control-Allow-Headers"] = "Content-Type, Authorization"
+        return response
+    @app.route("/v1/chat/completions", methods=["POST", "OPTIONS"])
+    def chat_completions():
+        if request.method == "OPTIONS":
+            return "", 200
+        raw_data = request.get_data(as_text=True)
+        plog.info(f"RAW REQUEST RECEIVED:\n{raw_data}")
+        data = request.get_json(force=True)
+        messages = data.get("messages", [])
+        max_tokens = int(data.get("max_tokens", 512))
+        temperature = float(data.get("temperature", 0.7))
+        stream = data.get("stream", False)
+        req_id = f"chatcmpl-{uuid.uuid4().hex[:8]}"
+        created = int(time.time())
+        # Dynamic model selection
+        model_name = data.get("model")
+        if not model_name:
+            if manager.active_model_name:
+                model_name = manager.active_model_name
+            else:
+                return jsonify({"error": "No model is currently loaded and no 'model' parameter was specified in the request."}), 400
+        try:
+            npu = manager.load_model(model_name)
+        except Exception as e:
+            log.error(f"Failed to load or compile model '{model_name}': {e}")
+            return jsonify({"error": f"Failed to load or compile model '{model_name}': {str(e)}"}), 500
+        try:
+            prompt = npu.apply_chat_template(messages)
+            log.debug(f"Generated Prompt:\n{prompt}")
+            plog.info(prompt)
+        except Exception as e:
+            log.error(f"Template failed: {e}")
+            prompt = messages[-1]["content"] if messages else ""
+            plog.info(f"FALLBACK PROMPT:\n{prompt}")
+        if stream:
+            def event_stream():
+                import queue
+                q = queue.Queue()
+                done = threading.Event()
+                def cb(token: str) -> bool:
+                    q.put(token)
+                    return False
+                def run_gen():
+                    try:
+                        npu.generate(prompt, max_new_tokens=max_tokens, temperature=temperature, stream_cb=cb)
+                    except Exception as ex:
+                        log.error(f"Stream generation error: {ex}")
+                    finally:
+                        done.set()
+                threading.Thread(target=run_gen).start()
+                while not (done.is_set() and q.empty()):
+                    try:
+                        tok = q.get(timeout=0.1)
+                        yield f"data: {json.dumps({'choices': [{'delta': {'content': tok}}]})}\n\n"
+                    except Exception:
+                        pass
+                yield "data: [DONE]\n\n"
+            return Response(event_stream(), mimetype="text/event-stream")
+        t0 = time.time()
+        result = npu.generate(prompt, max_new_tokens=max_tokens, temperature=temperature)
+        log.debug(f"Generation Result: '{result}'")
+        gen_ms = round((time.time() - t0) * 1000)
+        return jsonify({
+            "id": req_id,
+            "object": "chat.completion",
+            "created": created,
+            "model": npu.model_name,
+            "choices": [{
+                "index": 0,
+                "message": {"role": "assistant", "content": result},
+                "finish_reason": "stop",
+            }],
+            "timings": {"generation_ms": gen_ms},
+        })
+    return app
+def run_server(
+    genai_cache_root: str | Path | None = None,
+    hf_hub_cache: str | Path | None = None,
+    model_name: str | None = None,
+    allow_download: bool = True,
+    port: int = 8080,
+    host: str = "0.0.0.0",
+    log_file: str | Path | None = None,
+    prompt_log_file: str | Path | None = None,
+    threaded: bool = True
+):
+    """
+    Starts the NPU model server.
+      - If `model_name` is provided, pre-loads the model immediately at startup.
+      - If `model_name` is not provided, starts in idle mode and lazy-loads/compiles
+        models dynamically based on request criteria.
+    All paths are optional and fall back to subfolders inside ~/.cache/npuserver/ by default.
+    """
+    logger, prompt_logger = setup_logging(log_file=log_file, prompt_log_file=prompt_log_file)
+    manager = NPUPipelineManager(
+        genai_cache_root=genai_cache_root,
+        hf_hub_cache=hf_hub_cache,
+        allow_download=allow_download,
+        logger=logger
+    )
+    if model_name:
+        logger.info(f"Pre-loading model '{model_name}' during server startup...")
+        try:
+            manager.load_model(model_name)
+        except Exception as e:
+            logger.error(f"Failed to pre-load model '{model_name}': {e}. Server starting in idle mode.")
+    app = build_flask_app(manager, logger=logger, prompt_logger=prompt_logger)
+    logger.info(f"Running NPU model server on {host}:{port}")
+    app.run(host=host, port=port, threaded=threaded)