PyPI - xinference - Versions diffs - 0.1.2__tar.gz → 0.1.3__tar.gz - Mend

xinference 0.1.2tar.gz → 0.1.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of xinference might be problematic. Click here for more details.

Files changed (58) hide show

{xinference-0.1.2/xinference.egg-info → xinference-0.1.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: xinference
-Version: 0.1.2
+Version: 0.1.3
 Summary: Model Serving Made Easy
 Home-page: https://github.com/xorbitsai/inference
 Author: Qin Xuye
@@ -238,6 +238,110 @@ $ xinference list --all
 - If you want to use Apple Metal GPU for acceleration, please choose the q4_0 and q4_1 quantization methods.
 - `llama-2-chat` 70B ggmlv3 model only supports q4_0 quantization currently.
+## Custom models \[Experimental\]
+Custom models are currently an experimental feature and are expected to be officially released in version v0.2.0.
+Define a custom model based on the following template:
+```python
+custom_model = {
+  "version": 1,
+  # model name. must start with a letter or a
+  # digit, and can only contain letters, digits,
+  # underscores, or dashes.
+  "model_name": "nsql-2B",
+  # supported languages
+  "model_lang": [
+    "en"
+  ],
+  # model abilities. could be "embed", "generate"
+  # and "chat".
+  "model_ability": [
+    "generate"
+  ],
+  # model specifications.
+  "model_specs": [
+    {
+      # model format.
+      "model_format": "pytorch",
+      "model_size_in_billions": 2,
+      # quantizations.
+      "quantizations": [
+        "4-bit",
+        "8-bit",
+        "none"
+      ],
+      # hugging face model ID.
+      "model_id": "NumbersStation/nsql-2B"
+    }
+  ],
+  # prompt style, required by chat models.
+  # for more details, see: xinference/model/llm/tests/test_utils.py
+  "prompt_style": None
+}
+```
+Register the custom model:
+```python
+import json
+from xinference.client import Client
+# replace with real xinference endpoint
+endpoint = "http://localhost:9997"
+client = Client(endpoint)
+client.register_model(model_type="LLM", model=json.dumps(custom_model), persist=False)
+```
+Load the custom model:
+```python
+uid = client.launch_model(model_name='nsql-2B')
+```
+Run the custom model:
+```python
+text = """CREATE TABLE work_orders (
+    ID NUMBER,
+    CREATED_AT TEXT,
+    COST FLOAT,
+    INVOICE_AMOUNT FLOAT,
+    IS_DUE BOOLEAN,
+    IS_OPEN BOOLEAN,
+    IS_OVERDUE BOOLEAN,
+    COUNTRY_NAME TEXT,
+)
+-- Using valid SQLite, answer the following questions for the tables provided above.
+-- how many work orders are open?
+SELECT"""
+model = client.get_model(model_uid=uid)
+model.generate(prompt=text)
+```
+Result:
+```json
+{
+   "id":"aeb5c87a-352e-11ee-89ad-9af9f16816c5",
+   "object":"text_completion",
+   "created":1691418511,
+   "model":"3b912fc4-352e-11ee-8e66-9af9f16816c5",
+   "choices":[
+      {
+         "text":" COUNT(*) FROM work_orders WHERE IS_OPEN = '1';",
+         "index":0,
+         "logprobs":"None",
+         "finish_reason":"stop"
+      }
+   ],
+   "usage":{
+      "prompt_tokens":117,
+      "completion_tokens":17,
+      "total_tokens":134
+   }
+}
+```
 ## Pytorch Model Best Practices

{xinference-0.1.2 → xinference-0.1.3}/README.md RENAMED Viewed

@@ -210,6 +210,110 @@ $ xinference list --all
 - If you want to use Apple Metal GPU for acceleration, please choose the q4_0 and q4_1 quantization methods.
 - `llama-2-chat` 70B ggmlv3 model only supports q4_0 quantization currently.
+## Custom models \[Experimental\]
+Custom models are currently an experimental feature and are expected to be officially released in version v0.2.0.
+Define a custom model based on the following template:
+```python
+custom_model = {
+  "version": 1,
+  # model name. must start with a letter or a
+  # digit, and can only contain letters, digits,
+  # underscores, or dashes.
+  "model_name": "nsql-2B",
+  # supported languages
+  "model_lang": [
+    "en"
+  ],
+  # model abilities. could be "embed", "generate"
+  # and "chat".
+  "model_ability": [
+    "generate"
+  ],
+  # model specifications.
+  "model_specs": [
+    {
+      # model format.
+      "model_format": "pytorch",
+      "model_size_in_billions": 2,
+      # quantizations.
+      "quantizations": [
+        "4-bit",
+        "8-bit",
+        "none"
+      ],
+      # hugging face model ID.
+      "model_id": "NumbersStation/nsql-2B"
+    }
+  ],
+  # prompt style, required by chat models.
+  # for more details, see: xinference/model/llm/tests/test_utils.py
+  "prompt_style": None
+}
+```
+Register the custom model:
+```python
+import json
+from xinference.client import Client
+# replace with real xinference endpoint
+endpoint = "http://localhost:9997"
+client = Client(endpoint)
+client.register_model(model_type="LLM", model=json.dumps(custom_model), persist=False)
+```
+Load the custom model:
+```python
+uid = client.launch_model(model_name='nsql-2B')
+```
+Run the custom model:
+```python
+text = """CREATE TABLE work_orders (
+    ID NUMBER,
+    CREATED_AT TEXT,
+    COST FLOAT,
+    INVOICE_AMOUNT FLOAT,
+    IS_DUE BOOLEAN,
+    IS_OPEN BOOLEAN,
+    IS_OVERDUE BOOLEAN,
+    COUNTRY_NAME TEXT,
+)
+-- Using valid SQLite, answer the following questions for the tables provided above.
+-- how many work orders are open?
+SELECT"""
+model = client.get_model(model_uid=uid)
+model.generate(prompt=text)
+```
+Result:
+```json
+{
+   "id":"aeb5c87a-352e-11ee-89ad-9af9f16816c5",
+   "object":"text_completion",
+   "created":1691418511,
+   "model":"3b912fc4-352e-11ee-8e66-9af9f16816c5",
+   "choices":[
+      {
+         "text":" COUNT(*) FROM work_orders WHERE IS_OPEN = '1';",
+         "index":0,
+         "logprobs":"None",
+         "finish_reason":"stop"
+      }
+   ],
+   "usage":{
+      "prompt_tokens":117,
+      "completion_tokens":17,
+      "total_tokens":134
+   }
+}
+```
 ## Pytorch Model Best Practices

{xinference-0.1.2 → xinference-0.1.3}/setup.cfg RENAMED Viewed

@@ -60,7 +60,6 @@ dev =
 	flake8>=3.8.0
 	black
 all =
-	chatglm-cpp
 	llama-cpp-python>=0.1.77
 	transformers>=4.31.0
 	torch
@@ -72,7 +71,6 @@ all =
 	einops
 	tiktoken
 ggml =
-	chatglm-cpp
 	llama-cpp-python>=0.1.77
 pytorch =
 	transformers>=4.31.0

{xinference-0.1.2 → xinference-0.1.3}/xinference/_version.py RENAMED Viewed

@@ -8,11 +8,11 @@ import json
 version_json = '''
 {
- "date": "2023-08-04T18:35:56+0800",
+ "date": "2023-08-09T18:43:41+0800",
  "dirty": false,
  "error": null,
- "full-revisionid": "98765f249b05b51514078cc97b88e92ce40e6948",
- "version": "0.1.2"
+ "full-revisionid": "4d2f61cb6591ac94624f035b37259a89002abefd",
+ "version": "0.1.3"
 }
 '''  # END VERSION_JSON

{xinference-0.1.2 → xinference-0.1.3}/xinference/core/worker.py RENAMED Viewed

@@ -108,6 +108,7 @@ class WorkerActor(xo.Actor):
             "model_format": llm_spec.model_format,
             "model_size_in_billions": llm_spec.model_size_in_billions,
             "quantization": quantization,
+            "revision": llm_spec.model_revision,
         }
     @log_sync(logger=logger)

{xinference-0.1.2 → xinference-0.1.3}/xinference/deploy/cmdline.py RENAMED Viewed

@@ -11,8 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import configparser
 import logging
 import os
 import sys
@@ -31,6 +30,32 @@ from ..constants import (
 )
+def get_config_string(log_level: str) -> str:
+    return f"""
+        [loggers]
+        keys=root
+        [handlers]
+        keys=stream_handler
+        [formatters]
+        keys=formatter
+        [logger_root]
+        level={log_level.upper()}
+        handlers=stream_handler
+        [handler_stream_handler]
+        class=StreamHandler
+        formatter=formatter
+        level={log_level.upper()}
+        args=(sys.stderr,)
+        [formatter_formatter]
+        format=%(asctime)s %(name)-12s %(process)d %(levelname)-8s %(message)s
+        """
 def get_endpoint(endpoint: Optional[str]) -> str:
     # user didn't specify the endpoint.
     if endpoint is None:
@@ -58,9 +83,10 @@ def cli(
     if ctx.invoked_subcommand is None:
         from .local import main
-        if log_level:
-            logging.basicConfig(level=logging.getLevelName(log_level.upper()))
-        logging_conf = dict(level=log_level.upper())
+        logging_conf = configparser.RawConfigParser()
+        logger_config_string = get_config_string(log_level)
+        logging_conf.read_string(logger_config_string)
+        logging.config.fileConfig(logging_conf)  # type: ignore
         address = f"{host}:{get_next_port()}"
@@ -103,9 +129,10 @@ def supervisor(
 def worker(log_level: str, endpoint: Optional[str], host: str):
     from ..deploy.worker import main
-    if log_level:
-        logging.basicConfig(level=logging.getLevelName(log_level.upper()))
-    logging_conf = dict(level=log_level.upper())
+    logging_conf = configparser.RawConfigParser()
+    logger_config_string = get_config_string(log_level)
+    logging_conf.read_string(logger_config_string)
+    logging.config.fileConfig(level=logging.getLevelName(log_level.upper()))  # type: ignore
     endpoint = get_endpoint(endpoint)

{xinference-0.1.2 → xinference-0.1.3}/xinference/deploy/worker.py RENAMED Viewed

@@ -14,7 +14,7 @@
 import asyncio
 import logging
-from typing import Dict, Optional
+from typing import Any, Dict, Optional
 import xoscar as xo
@@ -53,7 +53,7 @@ async def _start_worker(
     await pool.join()
-def main(address: str, supervisor_address: str, logging_conf: Optional[Dict] = None):
+def main(address: str, supervisor_address: str, logging_conf: Any = None):
     loop = asyncio.get_event_loop()
     task = loop.create_task(_start_worker(address, supervisor_address, logging_conf))

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/ggml/llamacpp.py RENAMED Viewed

@@ -139,6 +139,7 @@ class LlamaCppModel(LLM):
             llamacpp_model_config["n_gqa"] = 8
         if self._is_darwin_and_apple_silicon() and self._can_apply_metal():
+            # TODO: platform.processor() is not safe, need to be replaced to other method.
             llamacpp_model_config.setdefault("n_gpu_layers", 1)
         elif self._is_linux() and self._can_apply_cublas():
             llamacpp_model_config.setdefault("n_gpu_layers", self._gpu_layers)

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/llm_family.json RENAMED Viewed

@@ -41,7 +41,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "baichuan-inc/Baichuan-7B"
+        "model_id": "baichuan-inc/Baichuan-7B",
+        "model_revision": "c1a5c7d5b7f50ecc51bb0e08150a9f12e5656756"
       },
       {
         "model_format": "pytorch",
@@ -51,7 +52,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "baichuan-inc/Baichuan-13B-Base"
+        "model_id": "baichuan-inc/Baichuan-13B-Base",
+        "model_revision": "0ef0739c7bdd34df954003ef76d80f3dabca2ff9"
       }
     ],
     "prompt_style": null
@@ -98,7 +100,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "baichuan-inc/Baichuan-13B-Chat"
+        "model_id": "baichuan-inc/Baichuan-13B-Chat",
+        "model_revision": "19ef51ba5bad8935b03acd20ff04a269210983bc"
       }
     ],
     "prompt_style": {
@@ -267,7 +270,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "lmsys/vicuna-33b-v1.3"
+        "model_id": "lmsys/vicuna-33b-v1.3",
+        "model_revision": "ef8d6becf883fb3ce52e3706885f761819477ab4"
       },
       {
         "model_format": "pytorch",
@@ -277,7 +281,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "lmsys/vicuna-13b-v1.3"
+        "model_id": "lmsys/vicuna-13b-v1.3",
+        "model_revision": "6566e9cb1787585d1147dcf4f9bc48f29e1328d2"
       },
       {
         "model_format": "pytorch",
@@ -287,7 +292,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "lmsys/vicuna-7b-v1.3"
+        "model_id": "lmsys/vicuna-7b-v1.3",
+        "model_revision": "236eeeab96f0dc2e463f2bebb7bb49809279c6d6"
       }
     ],
     "prompt_style": {
@@ -395,7 +401,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "THUDM/chatglm-6b"
+        "model_id": "THUDM/chatglm-6b",
+        "model_revision": "b1502f4f75c71499a3d566b14463edd62620ce9f"
       }
     ],
     "prompt_style": {
@@ -441,7 +448,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "THUDM/chatglm2-6b"
+        "model_id": "THUDM/chatglm2-6b",
+        "model_revision": "b1502f4f75c71499a3d566b14463edd62620ce9f"
       }
     ],
     "prompt_style": {
@@ -474,7 +482,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "THUDM/chatglm2-6b-32k"
+        "model_id": "THUDM/chatglm2-6b-32k",
+        "model_revision": "455746d4706479a1cbbd07179db39eb2741dc692"
       }
     ],
     "prompt_style": {
@@ -643,7 +652,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "facebook/opt-125m"
+        "model_id": "facebook/opt-125m",
+        "model_revision": "3d2b5f275bdf882b8775f902e1bfdb790e2cfc32"
       }
     ],
     "prompt_style": null
@@ -667,7 +677,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "tiiuae/falcon-40b"
+        "model_id": "tiiuae/falcon-40b",
+        "model_revision": "561820f7eef0cc56a31ea38af15ca1acb07fab5d"
       },
       {
         "model_format": "pytorch",
@@ -677,7 +688,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "tiiuae/falcon-7b"
+        "model_id": "tiiuae/falcon-7b",
+        "model_revision": "378337427557d1df3e742264a2901a49f25d4eb1"
       }
     ],
     "prompt_style": null
@@ -701,7 +713,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "tiiuae/falcon-7b-instruct"
+        "model_id": "tiiuae/falcon-7b-instruct",
+        "model_revision": "eb410fb6ffa9028e97adb801f0d6ec46d02f8b07"
       },
       {
         "model_format": "pytorch",
@@ -711,7 +724,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "tiiuae/falcon-40b-instruct"
+        "model_id": "tiiuae/falcon-40b-instruct",
+        "model_revision": "ca78eac0ed45bf64445ff0687fabba1598daebf3"
       }
     ],
     "prompt_style": {
@@ -759,7 +773,8 @@
           "8-bit",
           "none"
         ],
-        "model_id": "Qwen/Qwen-7B-Chat"
+        "model_id": "Qwen/Qwen-7B-Chat",
+        "model_revision": "5c611a5cde5769440581f91e8b4bba050f62b1af"
       }
     ],
     "prompt_style": {

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/llm_family.py RENAMED Viewed

@@ -34,6 +34,7 @@ class GgmlLLMSpecV1(BaseModel):
     model_id: str
     model_file_name_template: str
     model_uri: Optional[str]
+    model_revision: Optional[str]
 class PytorchLLMSpecV1(BaseModel):
@@ -42,6 +43,7 @@ class PytorchLLMSpecV1(BaseModel):
     quantizations: List[str]
     model_id: str
     model_uri: Optional[str]
+    model_revision: Optional[str]
 class PromptStyleV1(BaseModel):
@@ -139,6 +141,7 @@ def cache_from_huggingface(
         assert isinstance(llm_spec, PytorchLLMSpecV1)
         huggingface_hub.snapshot_download(
             llm_spec.model_id,
+            revision=llm_spec.model_revision,
             local_dir=cache_dir,
             local_dir_use_symlinks=True,
         )
@@ -147,6 +150,7 @@ def cache_from_huggingface(
         file_name = llm_spec.model_file_name_template.format(quantization=quantization)
         huggingface_hub.hf_hub_download(
             llm_spec.model_id,
+            revision=llm_spec.model_revision,
             filename=file_name,
             local_dir=cache_dir,
             local_dir_use_symlinks=True,
@@ -160,13 +164,11 @@ def _is_linux():
 def _has_cuda_device():
-    cuda_visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES")
-    if cuda_visible_devices:
-        return True
-    else:
-        from xorbits._mars.resource import cuda_count
+    # `cuda_count` method already contains the logic for the
+    # number of GPUs specified by `CUDA_VISIBLE_DEVICES`.
+    from xorbits._mars.resource import cuda_count
-        return cuda_count() > 0
+    return cuda_count() > 0
 def get_user_defined_llm_families():

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/pytorch/core.py RENAMED Viewed

@@ -47,7 +47,7 @@ class PytorchGenerateConfig(TypedDict, total=False):
 class PytorchModelConfig(TypedDict, total=False):
-    revision: str
+    revision: Optional[str]
     device: str
     gpus: Optional[str]
     num_gpus: int
@@ -79,17 +79,14 @@ class PytorchModel(LLM):
     ) -> PytorchModelConfig:
         if pytorch_model_config is None:
             pytorch_model_config = PytorchModelConfig()
-        pytorch_model_config.setdefault("revision", "main")
+        pytorch_model_config.setdefault("revision", self.model_spec.model_revision)
         pytorch_model_config.setdefault("gpus", None)
         pytorch_model_config.setdefault("num_gpus", 1)
         pytorch_model_config.setdefault("gptq_ckpt", None)
         pytorch_model_config.setdefault("gptq_wbits", 16)
         pytorch_model_config.setdefault("gptq_groupsize", -1)
         pytorch_model_config.setdefault("gptq_act_order", False)
-        if self._is_darwin_and_apple_silicon():
-            pytorch_model_config.setdefault("device", "mps")
-        else:
-            pytorch_model_config.setdefault("device", "cuda")
+        pytorch_model_config.setdefault("device", "auto")
         return pytorch_model_config
     def _sanitize_generate_config(
@@ -142,26 +139,35 @@ class PytorchModel(LLM):
         quantization = self.quantization
         num_gpus = self._pytorch_model_config.get("num_gpus", 1)
-        if self._is_darwin_and_apple_silicon():
-            device = self._pytorch_model_config.get("device", "mps")
-        else:
-            device = self._pytorch_model_config.get("device", "cuda")
+        device = self._pytorch_model_config.get("device", "auto")
+        self._pytorch_model_config["device"] = self._select_device(device)
+        self._device = self._pytorch_model_config["device"]
-        if device == "cpu":
+        if self._device == "cpu":
             kwargs = {"torch_dtype": torch.float32}
-        elif device == "cuda":
+        elif self._device == "cuda":
             kwargs = {"torch_dtype": torch.float16}
-        elif device == "mps":
+        elif self._device == "mps":
             kwargs = {"torch_dtype": torch.float16}
         else:
-            raise ValueError(f"Device {device} is not supported in temporary")
-        kwargs["revision"] = self._pytorch_model_config.get("revision", "main")
+            raise ValueError(f"Device {self._device} is not supported in temporary")
+        kwargs["revision"] = self._pytorch_model_config.get(
+            "revision", self.model_spec.model_revision
+        )
         if quantization != "none":
-            if device == "cuda" and self._is_linux():
+            if self._device == "cuda" and self._is_linux():
                 kwargs["device_map"] = "auto"
                 if quantization == "4-bit":
                     kwargs["load_in_4bit"] = True
+                    kwargs["bnb_4bit_compute_dtype"] = torch.float16
+                    kwargs["bnb_4bit_use_double_quant"] = True
+                    kwargs["llm_int8_skip_modules"] = [
+                        "lm_head",
+                        "encoder",
+                        "EncDecAttention",
+                    ]
                 elif quantization == "8-bit":
                     kwargs["load_in_8bit"] = True
                 else:
@@ -178,7 +184,7 @@ class PytorchModel(LLM):
                 else:
                     self._model, self._tokenizer = load_compress_model(
                         model_path=self.model_path,
-                        device=device,
+                        device=self._device,
                         torch_dtype=kwargs["torch_dtype"],
                         use_fast=self._use_fast_tokenizer,
                         revision=kwargs["revision"],
@@ -189,11 +195,37 @@ class PytorchModel(LLM):
         self._model, self._tokenizer = self._load_model(kwargs)
         if (
-            device == "cuda" and num_gpus == 1 and quantization == "none"
-        ) or device == "mps":
-            self._model.to(device)
+            self._device == "cuda" and num_gpus == 1 and quantization == "none"
+        ) or self._device == "mps":
+            self._model.to(self._device)
         logger.debug(f"Model Memory: {self._model.get_memory_footprint()}")
+    def _select_device(self, device: str) -> str:
+        try:
+            import torch
+        except ImportError:
+            raise ImportError(
+                f"Failed to import module 'torch'. Please make sure 'torch' is installed.\n\n"
+            )
+        if device == "auto":
+            if torch.cuda.is_available():
+                return "cuda"
+            elif torch.backends.mps.is_available():
+                return "mps"
+            return "cpu"
+        elif device == "cuda":
+            if not torch.cuda.is_available():
+                raise ValueError("cuda is unavailable in your environment")
+        elif device == "mps":
+            if not torch.backends.mps.is_available():
+                raise ValueError("mps is unavailable in your environment")
+        elif device == "cpu":
+            pass
+        else:
+            raise ValueError(f"Device {device} is not supported in temporary")
+        return device
     @classmethod
     def match(cls, llm_family: "LLMFamilyV1", llm_spec: "LLMSpecV1") -> bool:
         if llm_spec.model_format != "pytorch":
@@ -222,21 +254,21 @@ class PytorchModel(LLM):
         )
         def generator_wrapper(
-            prompt: str, device: str, generate_config: PytorchGenerateConfig
+            prompt: str, generate_config: PytorchGenerateConfig
         ) -> Iterator[CompletionChunk]:
             if "falcon" in self.model_family.model_name:
                 for completion_chunk, _ in generate_stream_falcon(
-                    self._model, self._tokenizer, prompt, device, generate_config
+                    self._model, self._tokenizer, prompt, self._device, generate_config
                 ):
                     yield completion_chunk
             elif "chatglm" in self.model_family.model_name:
                 for completion_chunk, _ in generate_stream_chatglm(
-                    self._model, self._tokenizer, prompt, device, generate_config
+                    self._model, self._tokenizer, prompt, self._device, generate_config
                 ):
                     yield completion_chunk
             else:
                 for completion_chunk, _ in generate_stream(
-                    self._model, self._tokenizer, prompt, device, generate_config
+                    self._model, self._tokenizer, prompt, self._device, generate_config
                 ):
                     yield completion_chunk
@@ -250,24 +282,20 @@ class PytorchModel(LLM):
         assert self._tokenizer is not None
         stream = generate_config.get("stream", False)
-        if self._is_darwin_and_apple_silicon():
-            device = self._pytorch_model_config.get("device", "mps")
-        else:
-            device = self._pytorch_model_config.get("device", "cuda")
         if not stream:
             if "falcon" in self.model_family.model_name:
                 for completion_chunk, completion_usage in generate_stream_falcon(
-                    self._model, self._tokenizer, prompt, device, generate_config
+                    self._model, self._tokenizer, prompt, self._device, generate_config
                 ):
                     pass
             elif "chatglm" in self.model_family.model_name:
                 for completion_chunk, completion_usage in generate_stream_chatglm(
-                    self._model, self._tokenizer, prompt, device, generate_config
+                    self._model, self._tokenizer, prompt, self._device, generate_config
                 ):
                     pass
             else:
                 for completion_chunk, completion_usage in generate_stream(
-                    self._model, self._tokenizer, prompt, device, generate_config
+                    self._model, self._tokenizer, prompt, self._device, generate_config
                 ):
                     pass
             completion = Completion(
@@ -280,7 +308,7 @@ class PytorchModel(LLM):
             )
             return completion
         else:
-            return generator_wrapper(prompt, device, generate_config)
+            return generator_wrapper(prompt, generate_config)
     def create_embedding(self, input: Union[str, List[str]]) -> Embedding:
         try:
@@ -291,11 +319,6 @@ class PytorchModel(LLM):
                 "Could not import torch. Please install it with `pip install torch`."
             ) from e
-        if self._is_darwin_and_apple_silicon():
-            device = self._pytorch_model_config.get("device", "mps")
-        else:
-            device = self._pytorch_model_config.get("device", "cuda")
         if isinstance(input, str):
             inputs = [input]
         else:
@@ -308,8 +331,8 @@ class PytorchModel(LLM):
             encoding = tokenizer.batch_encode_plus(
                 inputs, padding=True, return_tensors="pt"
             )
-            input_ids = encoding["input_ids"].to(device)
-            attention_mask = encoding["attention_mask"].to(device)
+            input_ids = encoding["input_ids"].to(self._device)
+            attention_mask = encoding["attention_mask"].to(self._device)
             model_output = self._model(
                 input_ids, attention_mask, output_hidden_states=True
             )
@@ -342,7 +365,7 @@ class PytorchModel(LLM):
             embedding = []
             token_num = 0
             for index, text in enumerate(inputs):
-                input_ids = tokenizer.encode(text, return_tensors="pt").to(device)
+                input_ids = tokenizer.encode(text, return_tensors="pt").to(self._device)
                 model_output = self._model(input_ids, output_hidden_states=True)
                 if is_chatglm:
                     data = (model_output.hidden_states[-1].transpose(0, 1))[0]

{xinference-0.1.2 → xinference-0.1.3/xinference.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: xinference
-Version: 0.1.2
+Version: 0.1.3
 Summary: Model Serving Made Easy
 Home-page: https://github.com/xorbitsai/inference
 Author: Qin Xuye
@@ -238,6 +238,110 @@ $ xinference list --all
 - If you want to use Apple Metal GPU for acceleration, please choose the q4_0 and q4_1 quantization methods.
 - `llama-2-chat` 70B ggmlv3 model only supports q4_0 quantization currently.
+## Custom models \[Experimental\]
+Custom models are currently an experimental feature and are expected to be officially released in version v0.2.0.
+Define a custom model based on the following template:
+```python
+custom_model = {
+  "version": 1,
+  # model name. must start with a letter or a
+  # digit, and can only contain letters, digits,
+  # underscores, or dashes.
+  "model_name": "nsql-2B",
+  # supported languages
+  "model_lang": [
+    "en"
+  ],
+  # model abilities. could be "embed", "generate"
+  # and "chat".
+  "model_ability": [
+    "generate"
+  ],
+  # model specifications.
+  "model_specs": [
+    {
+      # model format.
+      "model_format": "pytorch",
+      "model_size_in_billions": 2,
+      # quantizations.
+      "quantizations": [
+        "4-bit",
+        "8-bit",
+        "none"
+      ],
+      # hugging face model ID.
+      "model_id": "NumbersStation/nsql-2B"
+    }
+  ],
+  # prompt style, required by chat models.
+  # for more details, see: xinference/model/llm/tests/test_utils.py
+  "prompt_style": None
+}
+```
+Register the custom model:
+```python
+import json
+from xinference.client import Client
+# replace with real xinference endpoint
+endpoint = "http://localhost:9997"
+client = Client(endpoint)
+client.register_model(model_type="LLM", model=json.dumps(custom_model), persist=False)
+```
+Load the custom model:
+```python
+uid = client.launch_model(model_name='nsql-2B')
+```
+Run the custom model:
+```python
+text = """CREATE TABLE work_orders (
+    ID NUMBER,
+    CREATED_AT TEXT,
+    COST FLOAT,
+    INVOICE_AMOUNT FLOAT,
+    IS_DUE BOOLEAN,
+    IS_OPEN BOOLEAN,
+    IS_OVERDUE BOOLEAN,
+    COUNTRY_NAME TEXT,
+)
+-- Using valid SQLite, answer the following questions for the tables provided above.
+-- how many work orders are open?
+SELECT"""
+model = client.get_model(model_uid=uid)
+model.generate(prompt=text)
+```
+Result:
+```json
+{
+   "id":"aeb5c87a-352e-11ee-89ad-9af9f16816c5",
+   "object":"text_completion",
+   "created":1691418511,
+   "model":"3b912fc4-352e-11ee-8e66-9af9f16816c5",
+   "choices":[
+      {
+         "text":" COUNT(*) FROM work_orders WHERE IS_OPEN = '1';",
+         "index":0,
+         "logprobs":"None",
+         "finish_reason":"stop"
+      }
+   ],
+   "usage":{
+      "prompt_tokens":117,
+      "completion_tokens":17,
+      "total_tokens":134
+   }
+}
+```
 ## Pytorch Model Best Practices

{xinference-0.1.2 → xinference-0.1.3}/xinference.egg-info/requires.txt RENAMED Viewed

@@ -13,7 +13,6 @@ huggingface-hub<1.0,>=0.14.1
 typing_extensions
 [all]
-chatglm-cpp
 llama-cpp-python>=0.1.77
 transformers>=4.31.0
 torch
@@ -51,7 +50,6 @@ pydata-sphinx-theme>=0.3.0
 sphinx-intl>=0.9.9
 [ggml]
-chatglm-cpp
 llama-cpp-python>=0.1.77
 [pytorch]

{xinference-0.1.2 → xinference-0.1.3}/LICENSE RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/MANIFEST.in RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/pyproject.toml RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/setup.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/versioneer.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/__init__.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/client.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/constants.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/core/__init__.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/core/api.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/core/gradio.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/core/model.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/core/resource.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/core/restful_api.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/core/supervisor.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/core/utils.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/deploy/__init__.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/deploy/local.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/deploy/supervisor.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/deploy/test/__init__.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/deploy/utils.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/isolation.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/locale/__init__.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/locale/utils.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/locale/zh_CN.json RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/__init__.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/core.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/__init__.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/core.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/ggml/__init__.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/ggml/chatglm.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/pytorch/__init__.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/pytorch/baichuan.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/pytorch/chatglm.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/pytorch/compression.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/pytorch/falcon.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/pytorch/utils.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/pytorch/vicuna.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/model/llm/utils.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference/types.py RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference.egg-info/entry_points.txt RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference.egg-info/not-zip-safe RENAMED Viewed

File without changes

{xinference-0.1.2 → xinference-0.1.3}/xinference.egg-info/top_level.txt RENAMED Viewed

File without changes

xinference 0.1.2__tar.gz → 0.1.3__tar.gz

Potentially problematic release.

xinference 0.1.2tar.gz → 0.1.3tar.gz