vec-inf 0.5.0__py3-none-any.whl → 0.6.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: vec-inf
3
- Version: 0.5.0
3
+ Version: 0.6.1
4
4
  Summary: Efficient LLM inference on Slurm clusters using vLLM.
5
5
  Author-email: Marshall Wang <marshall.wang@vectorinstitute.ai>
6
6
  License-Expression: MIT
@@ -25,12 +25,14 @@ Description-Content-Type: text/markdown
25
25
  ----------------------------------------------------
26
26
 
27
27
  [![PyPI](https://img.shields.io/pypi/v/vec-inf)](https://pypi.org/project/vec-inf)
28
+ [![downloads](https://img.shields.io/pypi/dm/vec-inf)](https://pypistats.org/packages/vec-inf)
28
29
  [![code checks](https://github.com/VectorInstitute/vector-inference/actions/workflows/code_checks.yml/badge.svg)](https://github.com/VectorInstitute/vector-inference/actions/workflows/code_checks.yml)
29
- [![docs](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs_deploy.yml/badge.svg)](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs_deploy.yml)
30
- [![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/develop/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/develop)
30
+ [![docs](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs.yml/badge.svg)](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs.yml)
31
+ [![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/main/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/main)
32
+ [![vLLM](https://img.shields.io/badge/vllm-0.8.5.post1-blue)](https://docs.vllm.ai/en/v0.8.5.post1/index.html)
31
33
  ![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
32
34
 
33
- This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](vec_inf/cli/_helper.py), [`cli/_config.py`](vec_inf/cli/_config.py), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.yaml`](vec_inf/config/models.yaml) accordingly.
35
+ This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](vec_inf/config/models.yaml) accordingly.
34
36
 
35
37
  ## Installation
36
38
  If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -38,11 +40,13 @@ If you are using the Vector cluster environment, and you don't need any customiz
38
40
  ```bash
39
41
  pip install vec-inf
40
42
  ```
41
- Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package
43
+ Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up your own environment with the package. The latest image has `vLLM` version `0.8.5.post1`.
42
44
 
43
45
  ## Usage
44
46
 
45
- ### `launch` command
47
+ Vector Inference provides 2 user interfaces, a CLI and an API
48
+
49
+ ### CLI
46
50
 
47
51
  The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
48
52
 
@@ -53,18 +57,26 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct
53
57
  ```
54
58
  You should see an output like the following:
55
59
 
56
- <img width="600" alt="launch_img" src="https://github.com/user-attachments/assets/883e6a5b-8016-4837-8fdf-39097dfb18bf">
60
+ <img width="600" alt="launch_image" src="https://github.com/user-attachments/assets/a72a99fd-4bf2-408e-8850-359761d96c4f">
57
61
 
58
62
 
59
63
  #### Overrides
60
64
 
61
- Models that are already supported by `vec-inf` would be launched using the cached configuration or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
65
+ Models that are already supported by `vec-inf` would be launched using the cached configuration (set in [slurm_vars.py](vec_inf/client/slurm_vars.py)) or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
62
66
  overriden. For example, if `qos` is to be overriden:
63
67
 
64
68
  ```bash
65
69
  vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
66
70
  ```
67
71
 
72
+ To overwrite default vLLM engine arguments, you can specify the engine arguments in a comma separated string:
73
+
74
+ ```bash
75
+ vec-inf launch Meta-Llama-3.1-8B-Instruct --vllm-args '--max-model-len=65536,--compilation-config=3'
76
+ ```
77
+
78
+ For the full list of vLLM engine arguments, you can find them [here](https://docs.vllm.ai/en/stable/serving/engine_args.html), make sure you select the correct vLLM version.
79
+
68
80
  #### Custom models
69
81
 
70
82
  You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
@@ -89,14 +101,14 @@ models:
89
101
  gpus_per_node: 1
90
102
  num_nodes: 1
91
103
  vocab_size: 152064
92
- max_model_len: 1010000
93
- max_num_seqs: 256
94
- pipeline_parallelism: true
95
- enforce_eager: false
96
104
  qos: m2
97
105
  time: 08:00:00
98
106
  partition: a40
99
107
  model_weights_parent_dir: /h/<username>/model-weights
108
+ vllm_args:
109
+ --max-model-len: 1010000
110
+ --max-num-seqs: 256
111
+ --compilation-config: 3
100
112
  ```
101
113
 
102
114
  You would then set the `VEC_INF_CONFIG` path using:
@@ -105,68 +117,44 @@ You would then set the `VEC_INF_CONFIG` path using:
105
117
  export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
106
118
  ```
107
119
 
108
- Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`.
120
+ **NOTE**
121
+ * There are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
122
+ * Check [vLLM Engine Arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html) for the full list of available vLLM engine arguments, the default parallel size for any parallelization is default to 1, so none of the sizes were set specifically in this example
123
+ * For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`
124
+ * Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs.
109
125
 
110
- ### `status` command
111
- You can check the inference server status by providing the Slurm job ID to the `status` command:
112
- ```bash
113
- vec-inf status 15373800
114
- ```
115
-
116
- If the server is pending for resources, you should see an output like this:
117
-
118
- <img width="400" alt="status_pending_img" src="https://github.com/user-attachments/assets/b659c302-eae1-4560-b7a9-14eb3a822a2f">
126
+ #### Other commands
119
127
 
120
- When the server is ready, you should see an output like this:
128
+ * `status`: Check the model status by providing its Slurm job ID, `--json-mode` supported.
129
+ * `metrics`: Streams performance metrics to the console.
130
+ * `shutdown`: Shutdown a model by providing its Slurm job ID.
131
+ * `list`: List all available model names, or view the default/cached configuration of a specific model, `--json-mode` supported.
121
132
 
122
- <img width="400" alt="status_ready_img" src="https://github.com/user-attachments/assets/672986c2-736c-41ce-ac7c-1fb585cdcb0d">
133
+ For more details on the usage of these commands, refer to the [User Guide](https://vectorinstitute.github.io/vector-inference/user_guide/)
123
134
 
124
- There are 5 possible states:
135
+ ### API
125
136
 
126
- * **PENDING**: Job submitted to Slurm, but not executed yet. Job pending reason will be shown.
127
- * **LAUNCHING**: Job is running but the server is not ready yet.
128
- * **READY**: Inference server running and ready to take requests.
129
- * **FAILED**: Inference server in an unhealthy state. Job failed reason will be shown.
130
- * **SHUTDOWN**: Inference server is shutdown/cancelled.
137
+ Example:
131
138
 
132
- Note that the base URL is only available when model is in `READY` state, and if you've changed the Slurm log directory path, you also need to specify it when using the `status` command.
133
-
134
- ### `metrics` command
135
- Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command:
136
- ```bash
137
- vec-inf metrics 15373800
139
+ ```python
140
+ >>> from vec_inf.api import VecInfClient
141
+ >>> client = VecInfClient()
142
+ >>> response = client.launch_model("Meta-Llama-3.1-8B-Instruct")
143
+ >>> job_id = response.slurm_job_id
144
+ >>> status = client.get_status(job_id)
145
+ >>> if status.status == ModelStatus.READY:
146
+ ... print(f"Model is ready at {status.base_url}")
147
+ >>> client.shutdown_model(job_id)
138
148
  ```
139
149
 
140
- And you will see the performance metrics streamed to your console, note that the metrics are updated with a 2-second interval.
150
+ For details on the usage of the API, refer to the [API Reference](https://vectorinstitute.github.io/vector-inference/api/)
141
151
 
142
- <img width="400" alt="metrics_img" src="https://github.com/user-attachments/assets/3ee143d0-1a71-4944-bbd7-4c3299bf0339">
152
+ ## Check Job Configuration
143
153
 
144
- ### `shutdown` command
145
- Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
146
- ```bash
147
- vec-inf shutdown 15373800
148
-
149
- > Shutting down model with Slurm Job ID: 15373800
150
- ```
151
-
152
- ### `list` command
153
- You call view the full list of available models by running the `list` command:
154
- ```bash
155
- vec-inf list
156
- ```
157
- <img width="940" alt="list_img" src="https://github.com/user-attachments/assets/8cf901c4-404c-4398-a52f-0486f00747a3">
158
-
159
- NOTE: The above screenshot does not represent the full list of models supported.
160
-
161
- You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
162
- ```bash
163
- vec-inf list Meta-Llama-3.1-70B-Instruct
164
- ```
165
- <img width="500" alt="list_model_img" src="https://github.com/user-attachments/assets/34e53937-2d86-443e-85f6-34e408653ddb">
166
-
167
- `launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
154
+ With every model launch, a Slurm script will be generated dynamically based on the job and model configuration. Once the Slurm job is queued, the generated Slurm script will be moved to the log directory for reproducibility, located at `$log_dir/$model_family/$model_name.$slurm_job_id/$model_name.$slurm_job_id.slurm`. In the same directory you can also find a JSON file with the same name that captures the launch configuration, and will have an entry of server URL once the server is ready.
168
155
 
169
156
  ## Send inference requests
157
+
170
158
  Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
171
159
 
172
160
  ```json
@@ -199,8 +187,9 @@ Once the inference server is ready, you can start sending in inference requests.
199
187
  },
200
188
  "prompt_logprobs":null
201
189
  }
190
+
202
191
  ```
203
- **NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
192
+ **NOTE**: Certain models don't adhere to OpenAI's chat template, e.g. Mistral family. For these models, you can either change your prompt to follow the model's default chat template or provide your own chat template via `--chat-template: TEMPLATE_PATH`.
204
193
 
205
194
  ## SSH tunnel from your local device
206
195
  If you want to run inference from your local device, you can open a SSH tunnel to your cluster environment like the following:
@@ -0,0 +1,25 @@
1
+ vec_inf/README.md,sha256=3ocJHfV3kRftXFUCdHw3B-p4QQlXuNqkHnjPPNkCgfM,543
2
+ vec_inf/__init__.py,sha256=bHwSIz9lebYuxIemni-lP0h3gwJHVbJnwExQKGJWw_Q,23
3
+ vec_inf/find_port.sh,sha256=bGQ6LYSFVSsfDIGatrSg5YvddbZfaPL0R-Bjo4KYD6I,1088
4
+ vec_inf/cli/__init__.py,sha256=5XIvGQCOnaGl73XMkwetjC-Ul3xuXGrWDXdYJ3aUzvU,27
5
+ vec_inf/cli/_cli.py,sha256=pqZeQr5WxAsV7KSYcUnx_mRL7RnHWk1zf9CcW_ct5uI,10663
6
+ vec_inf/cli/_helper.py,sha256=i1QvJeIT3z7me6bv2Vot5c3NY555Dgo3q8iRlxhOlZ4,13047
7
+ vec_inf/cli/_utils.py,sha256=23vSbmvNOWY1-W1aOAwYqNDkDDmx-5UVlCiXAtxUZ8A,1057
8
+ vec_inf/cli/_vars.py,sha256=V6DrJs_BuUa4yNcbBSSnMwpcyXwEBsizy3D0ubIg2fA,777
9
+ vec_inf/client/__init__.py,sha256=OLlUJ4kL1R-Kh-nXNbvKlAZ3mtHcnozHprVufkVCNWk,739
10
+ vec_inf/client/_client_vars.py,sha256=KG-xImVIzJH3aj5nMUzT9w9LpH-7YGrOew6N77Fj0Js,7638
11
+ vec_inf/client/_exceptions.py,sha256=94Nx_5k1SriJNXzbdnwyXFZolyMutydU08Gsikawzzo,749
12
+ vec_inf/client/_helper.py,sha256=DcEFogbrSb4A8Kc2zixNZNL4nt4iswPk2n5blZgwEWQ,22338
13
+ vec_inf/client/_slurm_script_generator.py,sha256=XYCsadCLDEu9KrrjrNCNgoc0ITmjys9u7yWR9PkFAos,6376
14
+ vec_inf/client/_utils.py,sha256=1dB2O1neEhZNk6MJbBybLQm42vsmEevA2TI0F_kGi0o,8796
15
+ vec_inf/client/api.py,sha256=TYn4lP5Ene8MEuXWYo6ZbGYw9aPnaMlT32SH7jLCifM,9605
16
+ vec_inf/client/config.py,sha256=lPVHwiaGZjKd5M9G7vcsk3DMausFP_telq3JQngBkH8,5080
17
+ vec_inf/client/models.py,sha256=qjocUa5egJTVeVF3962kYOecs1dTaEb2e6TswkYFXM0,6141
18
+ vec_inf/client/slurm_vars.py,sha256=lroK41L4gEVVZNxxE3bEpbKsdMwnH79-7iCKd4zWEa4,1069
19
+ vec_inf/config/README.md,sha256=OlgnD_Ojei_xLkNyS7dGvYMFUzQFqjVRVw0V-QMk_3g,17863
20
+ vec_inf/config/models.yaml,sha256=xImSOjG9yL6LqqYkSLL7_wBZhqKM10-eFaQJ82gP4ig,29420
21
+ vec_inf-0.6.1.dist-info/METADATA,sha256=0YHT8rhEZINfmMF1hQBqU0HBpRbwX-1IeqY_Mla4g28,10682
22
+ vec_inf-0.6.1.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
23
+ vec_inf-0.6.1.dist-info/entry_points.txt,sha256=uNRXjCuJSR2nveEqD3IeMznI9oVI9YLZh5a24cZg6B0,49
24
+ vec_inf-0.6.1.dist-info/licenses/LICENSE,sha256=mq8zeqpvVSF1EsxmydeXcokt8XnEIfSofYn66S2-cJI,1073
25
+ vec_inf-0.6.1.dist-info/RECORD,,
vec_inf/cli/_config.py DELETED
@@ -1,87 +0,0 @@
1
- """Model configuration."""
2
-
3
- from pathlib import Path
4
- from typing import Optional, Union
5
-
6
- from pydantic import BaseModel, ConfigDict, Field
7
- from typing_extensions import Literal
8
-
9
-
10
- QOS = Literal[
11
- "normal",
12
- "m",
13
- "m2",
14
- "m3",
15
- "m4",
16
- "m5",
17
- "long",
18
- "deadline",
19
- "high",
20
- "scavenger",
21
- "llm",
22
- "a100",
23
- ]
24
-
25
- PARTITION = Literal["a40", "a100", "t4v1", "t4v2", "rtx6000"]
26
-
27
- DATA_TYPE = Literal["auto", "float16", "bfloat16", "float32"]
28
-
29
-
30
- class ModelConfig(BaseModel):
31
- """Pydantic model for validating and managing model deployment configurations."""
32
-
33
- model_name: str = Field(..., min_length=3, pattern=r"^[a-zA-Z0-9\-_\.]+$")
34
- model_family: str = Field(..., min_length=2)
35
- model_variant: Optional[str] = Field(
36
- default=None, description="Specific variant/version of the model family"
37
- )
38
- model_type: Literal["LLM", "VLM", "Text_Embedding", "Reward_Modeling"] = Field(
39
- ..., description="Type of model architecture"
40
- )
41
- gpus_per_node: int = Field(..., gt=0, le=8, description="GPUs per node")
42
- num_nodes: int = Field(..., gt=0, le=16, description="Number of nodes")
43
- vocab_size: int = Field(..., gt=0, le=1_000_000)
44
- max_model_len: int = Field(
45
- ..., gt=0, le=1_010_000, description="Maximum context length supported"
46
- )
47
- max_num_seqs: int = Field(
48
- default=256, gt=0, le=1024, description="Maximum concurrent request sequences"
49
- )
50
- compilation_config: int = Field(
51
- default=0,
52
- gt=-1,
53
- le=4,
54
- description="torch.compile optimization level",
55
- )
56
- gpu_memory_utilization: float = Field(
57
- default=0.9, gt=0.0, le=1.0, description="GPU memory utilization"
58
- )
59
- pipeline_parallelism: bool = Field(
60
- default=True, description="Enable pipeline parallelism"
61
- )
62
- enforce_eager: bool = Field(default=False, description="Force eager mode execution")
63
- qos: Union[QOS, str] = Field(default="m2", description="Quality of Service tier")
64
- time: str = Field(
65
- default="08:00:00",
66
- pattern=r"^\d{2}:\d{2}:\d{2}$",
67
- description="HH:MM:SS time limit",
68
- )
69
- partition: Union[PARTITION, str] = Field(
70
- default="a40", description="GPU partition type"
71
- )
72
- data_type: Union[DATA_TYPE, str] = Field(
73
- default="auto", description="Model precision format"
74
- )
75
- venv: str = Field(
76
- default="singularity", description="Virtual environment/container system"
77
- )
78
- log_dir: Path = Field(
79
- default=Path("~/.vec-inf-logs").expanduser(), description="Log directory path"
80
- )
81
- model_weights_parent_dir: Path = Field(
82
- default=Path("/model-weights"), description="Base directory for model weights"
83
- )
84
-
85
- model_config = ConfigDict(
86
- extra="forbid", str_strip_whitespace=True, validate_default=True, frozen=True
87
- )
@@ -1,154 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --cpus-per-task=16
3
- #SBATCH --mem=64G
4
- #SBATCH --exclusive
5
- #SBATCH --tasks-per-node=1
6
-
7
- source ${SRC_DIR}/find_port.sh
8
-
9
- if [ "$VENV_BASE" = "singularity" ]; then
10
- export SINGULARITY_IMAGE=/model-weights/vec-inf-shared/vector-inference_latest.sif
11
- export VLLM_NCCL_SO_PATH=/vec-inf/nccl/libnccl.so.2.18.1
12
- module load singularity-ce/3.8.2
13
- singularity exec $SINGULARITY_IMAGE ray stop
14
- fi
15
-
16
- # Getting the node names
17
- nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
18
- nodes_array=($nodes)
19
-
20
- head_node=${nodes_array[0]}
21
- head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
22
-
23
- # Find port for head node
24
- head_node_port=$(find_available_port $head_node_ip 8080 65535)
25
-
26
- # Starting the Ray head node
27
- ip_head=$head_node_ip:$head_node_port
28
- export ip_head
29
- echo "IP Head: $ip_head"
30
-
31
- echo "Starting HEAD at $head_node"
32
- if [ "$VENV_BASE" = "singularity" ]; then
33
- srun --nodes=1 --ntasks=1 -w "$head_node" \
34
- singularity exec --nv --bind ${MODEL_WEIGHTS}:${MODEL_WEIGHTS} $SINGULARITY_IMAGE \
35
- ray start --head --node-ip-address="$head_node_ip" --port=$head_node_port \
36
- --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
37
- else
38
- srun --nodes=1 --ntasks=1 -w "$head_node" \
39
- ray start --head --node-ip-address="$head_node_ip" --port=$head_node_port \
40
- --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
41
- fi
42
-
43
- # Starting the Ray worker nodes
44
- # Optional, though may be useful in certain versions of Ray < 1.0.
45
- sleep 10
46
-
47
- # number of nodes other than the head node
48
- worker_num=$((SLURM_JOB_NUM_NODES - 1))
49
-
50
- for ((i = 1; i <= worker_num; i++)); do
51
- node_i=${nodes_array[$i]}
52
- echo "Starting WORKER $i at $node_i"
53
- if [ "$VENV_BASE" = "singularity" ]; then
54
- srun --nodes=1 --ntasks=1 -w "$node_i" \
55
- singularity exec --nv --bind ${MODEL_WEIGHTS}:${MODEL_WEIGHTS} $SINGULARITY_IMAGE \
56
- ray start --address "$ip_head" \
57
- --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
58
- else
59
- srun --nodes=1 --ntasks=1 -w "$node_i" \
60
- ray start --address "$ip_head" \
61
- --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
62
- fi
63
-
64
- sleep 5
65
- done
66
-
67
-
68
- vllm_port_number=$(find_available_port $head_node_ip 8080 65535)
69
-
70
- SERVER_ADDR="http://${head_node_ip}:${vllm_port_number}/v1"
71
- echo "Server address: $SERVER_ADDR"
72
-
73
- jq --arg server_addr "$SERVER_ADDR" \
74
- '. + {"server_address": $server_addr}' \
75
- "$LOG_DIR/$MODEL_NAME.$SLURM_JOB_ID/$MODEL_NAME.$SLURM_JOB_ID.json" > temp.json \
76
- && mv temp.json "$LOG_DIR/$MODEL_NAME.$SLURM_JOB_ID/$MODEL_NAME.$SLURM_JOB_ID.json" \
77
- && rm temp.json
78
-
79
- if [ "$PIPELINE_PARALLELISM" = "True" ]; then
80
- export PIPELINE_PARALLEL_SIZE=$SLURM_JOB_NUM_NODES
81
- export TENSOR_PARALLEL_SIZE=$SLURM_GPUS_PER_NODE
82
- else
83
- export PIPELINE_PARALLEL_SIZE=1
84
- export TENSOR_PARALLEL_SIZE=$((SLURM_JOB_NUM_NODES*SLURM_GPUS_PER_NODE))
85
- fi
86
-
87
- if [ "$ENFORCE_EAGER" = "True" ]; then
88
- export ENFORCE_EAGER="--enforce-eager"
89
- else
90
- export ENFORCE_EAGER=""
91
- fi
92
-
93
- if [ "$ENABLE_PREFIX_CACHING" = "True" ]; then
94
- export ENABLE_PREFIX_CACHING="--enable-prefix-caching"
95
- else
96
- export ENABLE_PREFIX_CACHING=""
97
- fi
98
-
99
- if [ "$ENABLE_CHUNKED_PREFILL" = "True" ]; then
100
- export ENABLE_CHUNKED_PREFILL="--enable-chunked-prefill"
101
- else
102
- export ENABLE_CHUNKED_PREFILL=""
103
- fi
104
-
105
- if [ -z "$MAX_NUM_BATCHED_TOKENS" ]; then
106
- export MAX_NUM_BATCHED_TOKENS=""
107
- else
108
- export MAX_NUM_BATCHED_TOKENS="--max-num-batched-tokens=$MAX_NUM_BATCHED_TOKENS"
109
- fi
110
-
111
- # Activate vllm venv
112
- if [ "$VENV_BASE" = "singularity" ]; then
113
- singularity exec --nv --bind ${MODEL_WEIGHTS}:${MODEL_WEIGHTS} $SINGULARITY_IMAGE \
114
- python3.10 -m vllm.entrypoints.openai.api_server \
115
- --model ${MODEL_WEIGHTS} \
116
- --served-model-name ${MODEL_NAME} \
117
- --host "0.0.0.0" \
118
- --port ${vllm_port_number} \
119
- --pipeline-parallel-size ${PIPELINE_PARALLEL_SIZE} \
120
- --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \
121
- --dtype ${DATA_TYPE} \
122
- --trust-remote-code \
123
- --max-logprobs ${MAX_LOGPROBS} \
124
- --max-model-len ${MAX_MODEL_LEN} \
125
- --max-num-seqs ${MAX_NUM_SEQS} \
126
- --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
127
- --compilation-config ${COMPILATION_CONFIG} \
128
- --task ${TASK} \
129
- ${MAX_NUM_BATCHED_TOKENS} \
130
- ${ENABLE_PREFIX_CACHING} \
131
- ${ENABLE_CHUNKED_PREFILL} \
132
- ${ENFORCE_EAGER}
133
- else
134
- source ${VENV_BASE}/bin/activate
135
- python3 -m vllm.entrypoints.openai.api_server \
136
- --model ${MODEL_WEIGHTS} \
137
- --served-model-name ${MODEL_NAME} \
138
- --host "0.0.0.0" \
139
- --port ${vllm_port_number} \
140
- --pipeline-parallel-size ${PIPELINE_PARALLEL_SIZE} \
141
- --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \
142
- --dtype ${DATA_TYPE} \
143
- --trust-remote-code \
144
- --max-logprobs ${MAX_LOGPROBS} \
145
- --max-model-len ${MAX_MODEL_LEN} \
146
- --max-num-seqs ${MAX_NUM_SEQS} \
147
- --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
148
- --compilation-config ${COMPILATION_CONFIG} \
149
- --task ${TASK} \
150
- ${MAX_NUM_BATCHED_TOKENS} \
151
- ${ENABLE_PREFIX_CACHING} \
152
- ${ENABLE_CHUNKED_PREFILL} \
153
- ${ENFORCE_EAGER}
154
- fi
vec_inf/vllm.slurm DELETED
@@ -1,90 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --cpus-per-task=16
3
- #SBATCH --mem=64G
4
-
5
- source ${SRC_DIR}/find_port.sh
6
-
7
- # Write server url to file
8
- hostname=${SLURMD_NODENAME}
9
- vllm_port_number=$(find_available_port $hostname 8080 65535)
10
-
11
- SERVER_ADDR="http://${hostname}:${vllm_port_number}/v1"
12
- echo "Server address: $SERVER_ADDR"
13
-
14
- jq --arg server_addr "$SERVER_ADDR" \
15
- '. + {"server_address": $server_addr}' \
16
- "$LOG_DIR/$MODEL_NAME.$SLURM_JOB_ID/$MODEL_NAME.$SLURM_JOB_ID.json" > temp.json \
17
- && mv temp.json "$LOG_DIR/$MODEL_NAME.$SLURM_JOB_ID/$MODEL_NAME.$SLURM_JOB_ID.json" \
18
- && rm temp.json
19
-
20
- if [ "$ENFORCE_EAGER" = "True" ]; then
21
- export ENFORCE_EAGER="--enforce-eager"
22
- else
23
- export ENFORCE_EAGER=""
24
- fi
25
-
26
- if [ "$ENABLE_PREFIX_CACHING" = "True" ]; then
27
- export ENABLE_PREFIX_CACHING="--enable-prefix-caching"
28
- else
29
- export ENABLE_PREFIX_CACHING=""
30
- fi
31
-
32
- if [ "$ENABLE_CHUNKED_PREFILL" = "True" ]; then
33
- export ENABLE_CHUNKED_PREFILL="--enable-chunked-prefill"
34
- else
35
- export ENABLE_CHUNKED_PREFILL=""
36
- fi
37
-
38
- if [ -z "$MAX_NUM_BATCHED_TOKENS" ]; then
39
- export MAX_NUM_BATCHED_TOKENS=""
40
- else
41
- export MAX_NUM_BATCHED_TOKENS="--max-num-batched-tokens=$MAX_NUM_BATCHED_TOKENS"
42
- fi
43
-
44
- # Activate vllm venv
45
- if [ "$VENV_BASE" = "singularity" ]; then
46
- export SINGULARITY_IMAGE=/model-weights/vec-inf-shared/vector-inference_latest.sif
47
- export VLLM_NCCL_SO_PATH=/vec-inf/nccl/libnccl.so.2.18.1
48
- module load singularity-ce/3.8.2
49
- singularity exec $SINGULARITY_IMAGE ray stop
50
- singularity exec --nv --bind ${MODEL_WEIGHTS}:${MODEL_WEIGHTS} $SINGULARITY_IMAGE \
51
- python3.10 -m vllm.entrypoints.openai.api_server \
52
- --model ${MODEL_WEIGHTS} \
53
- --served-model-name ${MODEL_NAME} \
54
- --host "0.0.0.0" \
55
- --port ${vllm_port_number} \
56
- --tensor-parallel-size ${SLURM_GPUS_PER_NODE} \
57
- --dtype ${DATA_TYPE} \
58
- --max-logprobs ${MAX_LOGPROBS} \
59
- --trust-remote-code \
60
- --max-model-len ${MAX_MODEL_LEN} \
61
- --max-num-seqs ${MAX_NUM_SEQS} \
62
- --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
63
- --compilation-config ${COMPILATION_CONFIG} \
64
- --task ${TASK} \
65
- ${MAX_NUM_BATCHED_TOKENS} \
66
- ${ENABLE_PREFIX_CACHING} \
67
- ${ENABLE_CHUNKED_PREFILL} \
68
- ${ENFORCE_EAGER}
69
-
70
- else
71
- source ${VENV_BASE}/bin/activate
72
- python3 -m vllm.entrypoints.openai.api_server \
73
- --model ${MODEL_WEIGHTS} \
74
- --served-model-name ${MODEL_NAME} \
75
- --host "0.0.0.0" \
76
- --port ${vllm_port_number} \
77
- --tensor-parallel-size ${SLURM_GPUS_PER_NODE} \
78
- --dtype ${DATA_TYPE} \
79
- --max-logprobs ${MAX_LOGPROBS} \
80
- --trust-remote-code \
81
- --max-model-len ${MAX_MODEL_LEN} \
82
- --max-num-seqs ${MAX_NUM_SEQS} \
83
- --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
84
- --compilation-config ${COMPILATION_CONFIG} \
85
- --task ${TASK} \
86
- ${MAX_NUM_BATCHED_TOKENS} \
87
- ${ENABLE_PREFIX_CACHING} \
88
- ${ENABLE_CHUNKED_PREFILL} \
89
- ${ENFORCE_EAGER}
90
- fi
@@ -1,17 +0,0 @@
1
- vec_inf/README.md,sha256=dxX0xKfwLioG0mJ2YFv5JJ5q1m5NlWBrVBOap1wuHfQ,624
2
- vec_inf/__init__.py,sha256=bHwSIz9lebYuxIemni-lP0h3gwJHVbJnwExQKGJWw_Q,23
3
- vec_inf/find_port.sh,sha256=bGQ6LYSFVSsfDIGatrSg5YvddbZfaPL0R-Bjo4KYD6I,1088
4
- vec_inf/multinode_vllm.slurm,sha256=V01ayfgObPdxbQqhYvCbNIx0zqpLurDxZhS0UHYNFi0,5210
5
- vec_inf/vllm.slurm,sha256=VMMTdVUOtX4-Yv43yzgKiEpE56fMwuR0KOLf3Dar_S0,2884
6
- vec_inf/cli/__init__.py,sha256=5XIvGQCOnaGl73XMkwetjC-Ul3xuXGrWDXdYJ3aUzvU,27
7
- vec_inf/cli/_cli.py,sha256=8Gk4NRbrY2-3EX0S8_-1UOmGfahzqX0A2sJNVcL7OL8,6525
8
- vec_inf/cli/_config.py,sha256=pb2ERbxZoRZBa9Ie7-jlzyQiiXYZgqUetbw13Blryho,2841
9
- vec_inf/cli/_helper.py,sha256=9niBuFoaJfeP2yRHKcrkia4rwCZbAfoz-4s2MVCwA9w,26871
10
- vec_inf/cli/_utils.py,sha256=i3mffIJ-wBVpe3pz0mYa2W5J42yNRmUG2tQwABhQJDQ,5365
11
- vec_inf/config/README.md,sha256=3MYYY3hGw7jiMR5A8CMdBkhGFobcdH3Kip5E2saq_T4,18609
12
- vec_inf/config/models.yaml,sha256=p__omZEmoF93BtVQXQia43xvPX88ALrBT1tMiY2Bdhk,28787
13
- vec_inf-0.5.0.dist-info/METADATA,sha256=khBhIsW5hFjrp-ZQFU3wa5htLVkp4CH3JnrhLXeHfss,10228
14
- vec_inf-0.5.0.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
15
- vec_inf-0.5.0.dist-info/entry_points.txt,sha256=uNRXjCuJSR2nveEqD3IeMznI9oVI9YLZh5a24cZg6B0,49
16
- vec_inf-0.5.0.dist-info/licenses/LICENSE,sha256=mq8zeqpvVSF1EsxmydeXcokt8XnEIfSofYn66S2-cJI,1073
17
- vec_inf-0.5.0.dist-info/RECORD,,