vec-inf 0.5.0__py3-none-any.whl → 0.6.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: vec-inf
3
- Version: 0.5.0
3
+ Version: 0.6.0
4
4
  Summary: Efficient LLM inference on Slurm clusters using vLLM.
5
5
  Author-email: Marshall Wang <marshall.wang@vectorinstitute.ai>
6
6
  License-Expression: MIT
@@ -25,12 +25,13 @@ Description-Content-Type: text/markdown
25
25
  ----------------------------------------------------
26
26
 
27
27
  [![PyPI](https://img.shields.io/pypi/v/vec-inf)](https://pypi.org/project/vec-inf)
28
+ [![downloads](https://img.shields.io/pypi/dm/vec-inf)](https://pypistats.org/packages/vec-inf)
28
29
  [![code checks](https://github.com/VectorInstitute/vector-inference/actions/workflows/code_checks.yml/badge.svg)](https://github.com/VectorInstitute/vector-inference/actions/workflows/code_checks.yml)
29
- [![docs](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs_deploy.yml/badge.svg)](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs_deploy.yml)
30
- [![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/develop/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/develop)
30
+ [![docs](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs.yml/badge.svg)](https://github.com/VectorInstitute/vector-inference/actions/workflows/docs.yml)
31
+ [![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/main/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/main)
31
32
  ![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
32
33
 
33
- This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](vec_inf/cli/_helper.py), [`cli/_config.py`](vec_inf/cli/_config.py), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.yaml`](vec_inf/config/models.yaml) accordingly.
34
+ This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`vec_inf/client/slurm_vars.py`](vec_inf/client/slurm_vars.py), and the model config for cached model weights in [`vec_inf/config/models.yaml`](vec_inf/config/models.yaml) accordingly.
34
35
 
35
36
  ## Installation
36
37
  If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -42,7 +43,9 @@ Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up
42
43
 
43
44
  ## Usage
44
45
 
45
- ### `launch` command
46
+ Vector Inference provides 2 user interfaces, a CLI and an API
47
+
48
+ ### CLI
46
49
 
47
50
  The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
48
51
 
@@ -53,18 +56,26 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct
53
56
  ```
54
57
  You should see an output like the following:
55
58
 
56
- <img width="600" alt="launch_img" src="https://github.com/user-attachments/assets/883e6a5b-8016-4837-8fdf-39097dfb18bf">
59
+ <img width="600" alt="launch_image" src="https://github.com/user-attachments/assets/a72a99fd-4bf2-408e-8850-359761d96c4f">
57
60
 
58
61
 
59
62
  #### Overrides
60
63
 
61
- Models that are already supported by `vec-inf` would be launched using the cached configuration or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
64
+ Models that are already supported by `vec-inf` would be launched using the cached configuration (set in [slurm_vars.py](vec_inf/client/slurm_vars.py)) or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
62
65
  overriden. For example, if `qos` is to be overriden:
63
66
 
64
67
  ```bash
65
68
  vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
66
69
  ```
67
70
 
71
+ To overwrite default vLLM engine arguments, you can specify the engine arguments in a comma separated string:
72
+
73
+ ```bash
74
+ vec-inf launch Meta-Llama-3.1-8B-Instruct --vllm-args '--max-model-len=65536,--compilation-config=3'
75
+ ```
76
+
77
+ For the full list of vLLM engine arguments, you can find them [here](https://docs.vllm.ai/en/stable/serving/engine_args.html), make sure you select the correct vLLM version.
78
+
68
79
  #### Custom models
69
80
 
70
81
  You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
@@ -89,14 +100,14 @@ models:
89
100
  gpus_per_node: 1
90
101
  num_nodes: 1
91
102
  vocab_size: 152064
92
- max_model_len: 1010000
93
- max_num_seqs: 256
94
- pipeline_parallelism: true
95
- enforce_eager: false
96
103
  qos: m2
97
104
  time: 08:00:00
98
105
  partition: a40
99
106
  model_weights_parent_dir: /h/<username>/model-weights
107
+ vllm_args:
108
+ --max-model-len: 1010000
109
+ --max-num-seqs: 256
110
+ --compilation-confi: 3
100
111
  ```
101
112
 
102
113
  You would then set the `VEC_INF_CONFIG` path using:
@@ -105,68 +116,40 @@ You would then set the `VEC_INF_CONFIG` path using:
105
116
  export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
106
117
  ```
107
118
 
108
- Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`.
109
-
110
- ### `status` command
111
- You can check the inference server status by providing the Slurm job ID to the `status` command:
112
- ```bash
113
- vec-inf status 15373800
114
- ```
115
-
116
- If the server is pending for resources, you should see an output like this:
117
-
118
- <img width="400" alt="status_pending_img" src="https://github.com/user-attachments/assets/b659c302-eae1-4560-b7a9-14eb3a822a2f">
119
-
120
- When the server is ready, you should see an output like this:
119
+ Note that there are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
121
120
 
122
- <img width="400" alt="status_ready_img" src="https://github.com/user-attachments/assets/672986c2-736c-41ce-ac7c-1fb585cdcb0d">
121
+ #### Other commands
123
122
 
124
- There are 5 possible states:
123
+ * `status`: Check the model status by providing its Slurm job ID, `--json-mode` supported.
124
+ * `metrics`: Streams performance metrics to the console.
125
+ * `shutdown`: Shutdown a model by providing its Slurm job ID.
126
+ * `list`: List all available model names, or view the default/cached configuration of a specific model, `--json-mode` supported.
125
127
 
126
- * **PENDING**: Job submitted to Slurm, but not executed yet. Job pending reason will be shown.
127
- * **LAUNCHING**: Job is running but the server is not ready yet.
128
- * **READY**: Inference server running and ready to take requests.
129
- * **FAILED**: Inference server in an unhealthy state. Job failed reason will be shown.
130
- * **SHUTDOWN**: Inference server is shutdown/cancelled.
128
+ For more details on the usage of these commands, refer to the [User Guide](https://vectorinstitute.github.io/vector-inference/user_guide/)
131
129
 
132
- Note that the base URL is only available when model is in `READY` state, and if you've changed the Slurm log directory path, you also need to specify it when using the `status` command.
130
+ ### API
133
131
 
134
- ### `metrics` command
135
- Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command:
136
- ```bash
137
- vec-inf metrics 15373800
138
- ```
139
-
140
- And you will see the performance metrics streamed to your console, note that the metrics are updated with a 2-second interval.
141
-
142
- <img width="400" alt="metrics_img" src="https://github.com/user-attachments/assets/3ee143d0-1a71-4944-bbd7-4c3299bf0339">
143
-
144
- ### `shutdown` command
145
- Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
146
- ```bash
147
- vec-inf shutdown 15373800
132
+ Example:
148
133
 
149
- > Shutting down model with Slurm Job ID: 15373800
134
+ ```python
135
+ >>> from vec_inf.api import VecInfClient
136
+ >>> client = VecInfClient()
137
+ >>> response = client.launch_model("Meta-Llama-3.1-8B-Instruct")
138
+ >>> job_id = response.slurm_job_id
139
+ >>> status = client.get_status(job_id)
140
+ >>> if status.status == ModelStatus.READY:
141
+ ... print(f"Model is ready at {status.base_url}")
142
+ >>> client.shutdown_model(job_id)
150
143
  ```
151
144
 
152
- ### `list` command
153
- You call view the full list of available models by running the `list` command:
154
- ```bash
155
- vec-inf list
156
- ```
157
- <img width="940" alt="list_img" src="https://github.com/user-attachments/assets/8cf901c4-404c-4398-a52f-0486f00747a3">
158
-
159
- NOTE: The above screenshot does not represent the full list of models supported.
145
+ For details on the usage of the API, refer to the [API Reference](https://vectorinstitute.github.io/vector-inference/api/)
160
146
 
161
- You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
162
- ```bash
163
- vec-inf list Meta-Llama-3.1-70B-Instruct
164
- ```
165
- <img width="500" alt="list_model_img" src="https://github.com/user-attachments/assets/34e53937-2d86-443e-85f6-34e408653ddb">
147
+ ## Check Job Configuration
166
148
 
167
- `launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
149
+ With every model launch, a Slurm script will be generated dynamically based on the job and model configuration. Once the Slurm job is queued, the generated Slurm script will be moved to the log directory for reproducibility, located at `$log_dir/$model_family/$model_name.$slurm_job_id/$model_name.$slurm_job_id.slurm`. In the same directory you can also find a JSON file with the same name that captures the launch configuration, and will have an entry of server URL once the server is ready.
168
150
 
169
151
  ## Send inference requests
152
+
170
153
  Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
171
154
 
172
155
  ```json
@@ -0,0 +1,25 @@
1
+ vec_inf/README.md,sha256=3ocJHfV3kRftXFUCdHw3B-p4QQlXuNqkHnjPPNkCgfM,543
2
+ vec_inf/__init__.py,sha256=bHwSIz9lebYuxIemni-lP0h3gwJHVbJnwExQKGJWw_Q,23
3
+ vec_inf/find_port.sh,sha256=bGQ6LYSFVSsfDIGatrSg5YvddbZfaPL0R-Bjo4KYD6I,1088
4
+ vec_inf/cli/__init__.py,sha256=5XIvGQCOnaGl73XMkwetjC-Ul3xuXGrWDXdYJ3aUzvU,27
5
+ vec_inf/cli/_cli.py,sha256=bqyLvFK4Vqoh-wAaUPg50_qYbrW-c9Cl_-YySgVk5_M,9871
6
+ vec_inf/cli/_helper.py,sha256=i1QvJeIT3z7me6bv2Vot5c3NY555Dgo3q8iRlxhOlZ4,13047
7
+ vec_inf/cli/_utils.py,sha256=23vSbmvNOWY1-W1aOAwYqNDkDDmx-5UVlCiXAtxUZ8A,1057
8
+ vec_inf/cli/_vars.py,sha256=V6DrJs_BuUa4yNcbBSSnMwpcyXwEBsizy3D0ubIg2fA,777
9
+ vec_inf/client/__init__.py,sha256=OLlUJ4kL1R-Kh-nXNbvKlAZ3mtHcnozHprVufkVCNWk,739
10
+ vec_inf/client/_client_vars.py,sha256=eVQjpuASd8beBjAeAbQnMRZM8nCLZMHx-x62BcXVnYA,7163
11
+ vec_inf/client/_exceptions.py,sha256=94Nx_5k1SriJNXzbdnwyXFZolyMutydU08Gsikawzzo,749
12
+ vec_inf/client/_helper.py,sha256=76OTCroNR5e3e7T2qSV_tkexDaUQsJrs8bFiMJ5NaxU,22718
13
+ vec_inf/client/_slurm_script_generator.py,sha256=jFgr2Pu7b_Uqli3DBvxUr9MI1-3TA6wwxg07O2rTwPs,6299
14
+ vec_inf/client/_utils.py,sha256=1dB2O1neEhZNk6MJbBybLQm42vsmEevA2TI0F_kGi0o,8796
15
+ vec_inf/client/api.py,sha256=TYn4lP5Ene8MEuXWYo6ZbGYw9aPnaMlT32SH7jLCifM,9605
16
+ vec_inf/client/config.py,sha256=kOhxoepsvArxRFNlwq1sLDHsxDewLwxRV1VwsL0MrGU,4683
17
+ vec_inf/client/models.py,sha256=JZDUMBX3XKOClaq_yJUpDUSgiDy42nT5Dq5bxQWiO2I,5778
18
+ vec_inf/client/slurm_vars.py,sha256=lroK41L4gEVVZNxxE3bEpbKsdMwnH79-7iCKd4zWEa4,1069
19
+ vec_inf/config/README.md,sha256=OlgnD_Ojei_xLkNyS7dGvYMFUzQFqjVRVw0V-QMk_3g,17863
20
+ vec_inf/config/models.yaml,sha256=PR91vOzINVOkAco9S-_VIXQ5Un6ekeoWz2Pj8DMR8LQ,29630
21
+ vec_inf-0.6.0.dist-info/METADATA,sha256=-xadTsrAR3tOfPyxTdGB9DLuhWMu_mnp_JF5Aa-1-08,9755
22
+ vec_inf-0.6.0.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
23
+ vec_inf-0.6.0.dist-info/entry_points.txt,sha256=uNRXjCuJSR2nveEqD3IeMznI9oVI9YLZh5a24cZg6B0,49
24
+ vec_inf-0.6.0.dist-info/licenses/LICENSE,sha256=mq8zeqpvVSF1EsxmydeXcokt8XnEIfSofYn66S2-cJI,1073
25
+ vec_inf-0.6.0.dist-info/RECORD,,
vec_inf/cli/_config.py DELETED
@@ -1,87 +0,0 @@
1
- """Model configuration."""
2
-
3
- from pathlib import Path
4
- from typing import Optional, Union
5
-
6
- from pydantic import BaseModel, ConfigDict, Field
7
- from typing_extensions import Literal
8
-
9
-
10
- QOS = Literal[
11
- "normal",
12
- "m",
13
- "m2",
14
- "m3",
15
- "m4",
16
- "m5",
17
- "long",
18
- "deadline",
19
- "high",
20
- "scavenger",
21
- "llm",
22
- "a100",
23
- ]
24
-
25
- PARTITION = Literal["a40", "a100", "t4v1", "t4v2", "rtx6000"]
26
-
27
- DATA_TYPE = Literal["auto", "float16", "bfloat16", "float32"]
28
-
29
-
30
- class ModelConfig(BaseModel):
31
- """Pydantic model for validating and managing model deployment configurations."""
32
-
33
- model_name: str = Field(..., min_length=3, pattern=r"^[a-zA-Z0-9\-_\.]+$")
34
- model_family: str = Field(..., min_length=2)
35
- model_variant: Optional[str] = Field(
36
- default=None, description="Specific variant/version of the model family"
37
- )
38
- model_type: Literal["LLM", "VLM", "Text_Embedding", "Reward_Modeling"] = Field(
39
- ..., description="Type of model architecture"
40
- )
41
- gpus_per_node: int = Field(..., gt=0, le=8, description="GPUs per node")
42
- num_nodes: int = Field(..., gt=0, le=16, description="Number of nodes")
43
- vocab_size: int = Field(..., gt=0, le=1_000_000)
44
- max_model_len: int = Field(
45
- ..., gt=0, le=1_010_000, description="Maximum context length supported"
46
- )
47
- max_num_seqs: int = Field(
48
- default=256, gt=0, le=1024, description="Maximum concurrent request sequences"
49
- )
50
- compilation_config: int = Field(
51
- default=0,
52
- gt=-1,
53
- le=4,
54
- description="torch.compile optimization level",
55
- )
56
- gpu_memory_utilization: float = Field(
57
- default=0.9, gt=0.0, le=1.0, description="GPU memory utilization"
58
- )
59
- pipeline_parallelism: bool = Field(
60
- default=True, description="Enable pipeline parallelism"
61
- )
62
- enforce_eager: bool = Field(default=False, description="Force eager mode execution")
63
- qos: Union[QOS, str] = Field(default="m2", description="Quality of Service tier")
64
- time: str = Field(
65
- default="08:00:00",
66
- pattern=r"^\d{2}:\d{2}:\d{2}$",
67
- description="HH:MM:SS time limit",
68
- )
69
- partition: Union[PARTITION, str] = Field(
70
- default="a40", description="GPU partition type"
71
- )
72
- data_type: Union[DATA_TYPE, str] = Field(
73
- default="auto", description="Model precision format"
74
- )
75
- venv: str = Field(
76
- default="singularity", description="Virtual environment/container system"
77
- )
78
- log_dir: Path = Field(
79
- default=Path("~/.vec-inf-logs").expanduser(), description="Log directory path"
80
- )
81
- model_weights_parent_dir: Path = Field(
82
- default=Path("/model-weights"), description="Base directory for model weights"
83
- )
84
-
85
- model_config = ConfigDict(
86
- extra="forbid", str_strip_whitespace=True, validate_default=True, frozen=True
87
- )
@@ -1,154 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --cpus-per-task=16
3
- #SBATCH --mem=64G
4
- #SBATCH --exclusive
5
- #SBATCH --tasks-per-node=1
6
-
7
- source ${SRC_DIR}/find_port.sh
8
-
9
- if [ "$VENV_BASE" = "singularity" ]; then
10
- export SINGULARITY_IMAGE=/model-weights/vec-inf-shared/vector-inference_latest.sif
11
- export VLLM_NCCL_SO_PATH=/vec-inf/nccl/libnccl.so.2.18.1
12
- module load singularity-ce/3.8.2
13
- singularity exec $SINGULARITY_IMAGE ray stop
14
- fi
15
-
16
- # Getting the node names
17
- nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
18
- nodes_array=($nodes)
19
-
20
- head_node=${nodes_array[0]}
21
- head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
22
-
23
- # Find port for head node
24
- head_node_port=$(find_available_port $head_node_ip 8080 65535)
25
-
26
- # Starting the Ray head node
27
- ip_head=$head_node_ip:$head_node_port
28
- export ip_head
29
- echo "IP Head: $ip_head"
30
-
31
- echo "Starting HEAD at $head_node"
32
- if [ "$VENV_BASE" = "singularity" ]; then
33
- srun --nodes=1 --ntasks=1 -w "$head_node" \
34
- singularity exec --nv --bind ${MODEL_WEIGHTS}:${MODEL_WEIGHTS} $SINGULARITY_IMAGE \
35
- ray start --head --node-ip-address="$head_node_ip" --port=$head_node_port \
36
- --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
37
- else
38
- srun --nodes=1 --ntasks=1 -w "$head_node" \
39
- ray start --head --node-ip-address="$head_node_ip" --port=$head_node_port \
40
- --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
41
- fi
42
-
43
- # Starting the Ray worker nodes
44
- # Optional, though may be useful in certain versions of Ray < 1.0.
45
- sleep 10
46
-
47
- # number of nodes other than the head node
48
- worker_num=$((SLURM_JOB_NUM_NODES - 1))
49
-
50
- for ((i = 1; i <= worker_num; i++)); do
51
- node_i=${nodes_array[$i]}
52
- echo "Starting WORKER $i at $node_i"
53
- if [ "$VENV_BASE" = "singularity" ]; then
54
- srun --nodes=1 --ntasks=1 -w "$node_i" \
55
- singularity exec --nv --bind ${MODEL_WEIGHTS}:${MODEL_WEIGHTS} $SINGULARITY_IMAGE \
56
- ray start --address "$ip_head" \
57
- --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
58
- else
59
- srun --nodes=1 --ntasks=1 -w "$node_i" \
60
- ray start --address "$ip_head" \
61
- --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block &
62
- fi
63
-
64
- sleep 5
65
- done
66
-
67
-
68
- vllm_port_number=$(find_available_port $head_node_ip 8080 65535)
69
-
70
- SERVER_ADDR="http://${head_node_ip}:${vllm_port_number}/v1"
71
- echo "Server address: $SERVER_ADDR"
72
-
73
- jq --arg server_addr "$SERVER_ADDR" \
74
- '. + {"server_address": $server_addr}' \
75
- "$LOG_DIR/$MODEL_NAME.$SLURM_JOB_ID/$MODEL_NAME.$SLURM_JOB_ID.json" > temp.json \
76
- && mv temp.json "$LOG_DIR/$MODEL_NAME.$SLURM_JOB_ID/$MODEL_NAME.$SLURM_JOB_ID.json" \
77
- && rm temp.json
78
-
79
- if [ "$PIPELINE_PARALLELISM" = "True" ]; then
80
- export PIPELINE_PARALLEL_SIZE=$SLURM_JOB_NUM_NODES
81
- export TENSOR_PARALLEL_SIZE=$SLURM_GPUS_PER_NODE
82
- else
83
- export PIPELINE_PARALLEL_SIZE=1
84
- export TENSOR_PARALLEL_SIZE=$((SLURM_JOB_NUM_NODES*SLURM_GPUS_PER_NODE))
85
- fi
86
-
87
- if [ "$ENFORCE_EAGER" = "True" ]; then
88
- export ENFORCE_EAGER="--enforce-eager"
89
- else
90
- export ENFORCE_EAGER=""
91
- fi
92
-
93
- if [ "$ENABLE_PREFIX_CACHING" = "True" ]; then
94
- export ENABLE_PREFIX_CACHING="--enable-prefix-caching"
95
- else
96
- export ENABLE_PREFIX_CACHING=""
97
- fi
98
-
99
- if [ "$ENABLE_CHUNKED_PREFILL" = "True" ]; then
100
- export ENABLE_CHUNKED_PREFILL="--enable-chunked-prefill"
101
- else
102
- export ENABLE_CHUNKED_PREFILL=""
103
- fi
104
-
105
- if [ -z "$MAX_NUM_BATCHED_TOKENS" ]; then
106
- export MAX_NUM_BATCHED_TOKENS=""
107
- else
108
- export MAX_NUM_BATCHED_TOKENS="--max-num-batched-tokens=$MAX_NUM_BATCHED_TOKENS"
109
- fi
110
-
111
- # Activate vllm venv
112
- if [ "$VENV_BASE" = "singularity" ]; then
113
- singularity exec --nv --bind ${MODEL_WEIGHTS}:${MODEL_WEIGHTS} $SINGULARITY_IMAGE \
114
- python3.10 -m vllm.entrypoints.openai.api_server \
115
- --model ${MODEL_WEIGHTS} \
116
- --served-model-name ${MODEL_NAME} \
117
- --host "0.0.0.0" \
118
- --port ${vllm_port_number} \
119
- --pipeline-parallel-size ${PIPELINE_PARALLEL_SIZE} \
120
- --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \
121
- --dtype ${DATA_TYPE} \
122
- --trust-remote-code \
123
- --max-logprobs ${MAX_LOGPROBS} \
124
- --max-model-len ${MAX_MODEL_LEN} \
125
- --max-num-seqs ${MAX_NUM_SEQS} \
126
- --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
127
- --compilation-config ${COMPILATION_CONFIG} \
128
- --task ${TASK} \
129
- ${MAX_NUM_BATCHED_TOKENS} \
130
- ${ENABLE_PREFIX_CACHING} \
131
- ${ENABLE_CHUNKED_PREFILL} \
132
- ${ENFORCE_EAGER}
133
- else
134
- source ${VENV_BASE}/bin/activate
135
- python3 -m vllm.entrypoints.openai.api_server \
136
- --model ${MODEL_WEIGHTS} \
137
- --served-model-name ${MODEL_NAME} \
138
- --host "0.0.0.0" \
139
- --port ${vllm_port_number} \
140
- --pipeline-parallel-size ${PIPELINE_PARALLEL_SIZE} \
141
- --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \
142
- --dtype ${DATA_TYPE} \
143
- --trust-remote-code \
144
- --max-logprobs ${MAX_LOGPROBS} \
145
- --max-model-len ${MAX_MODEL_LEN} \
146
- --max-num-seqs ${MAX_NUM_SEQS} \
147
- --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
148
- --compilation-config ${COMPILATION_CONFIG} \
149
- --task ${TASK} \
150
- ${MAX_NUM_BATCHED_TOKENS} \
151
- ${ENABLE_PREFIX_CACHING} \
152
- ${ENABLE_CHUNKED_PREFILL} \
153
- ${ENFORCE_EAGER}
154
- fi
vec_inf/vllm.slurm DELETED
@@ -1,90 +0,0 @@
1
- #!/bin/bash
2
- #SBATCH --cpus-per-task=16
3
- #SBATCH --mem=64G
4
-
5
- source ${SRC_DIR}/find_port.sh
6
-
7
- # Write server url to file
8
- hostname=${SLURMD_NODENAME}
9
- vllm_port_number=$(find_available_port $hostname 8080 65535)
10
-
11
- SERVER_ADDR="http://${hostname}:${vllm_port_number}/v1"
12
- echo "Server address: $SERVER_ADDR"
13
-
14
- jq --arg server_addr "$SERVER_ADDR" \
15
- '. + {"server_address": $server_addr}' \
16
- "$LOG_DIR/$MODEL_NAME.$SLURM_JOB_ID/$MODEL_NAME.$SLURM_JOB_ID.json" > temp.json \
17
- && mv temp.json "$LOG_DIR/$MODEL_NAME.$SLURM_JOB_ID/$MODEL_NAME.$SLURM_JOB_ID.json" \
18
- && rm temp.json
19
-
20
- if [ "$ENFORCE_EAGER" = "True" ]; then
21
- export ENFORCE_EAGER="--enforce-eager"
22
- else
23
- export ENFORCE_EAGER=""
24
- fi
25
-
26
- if [ "$ENABLE_PREFIX_CACHING" = "True" ]; then
27
- export ENABLE_PREFIX_CACHING="--enable-prefix-caching"
28
- else
29
- export ENABLE_PREFIX_CACHING=""
30
- fi
31
-
32
- if [ "$ENABLE_CHUNKED_PREFILL" = "True" ]; then
33
- export ENABLE_CHUNKED_PREFILL="--enable-chunked-prefill"
34
- else
35
- export ENABLE_CHUNKED_PREFILL=""
36
- fi
37
-
38
- if [ -z "$MAX_NUM_BATCHED_TOKENS" ]; then
39
- export MAX_NUM_BATCHED_TOKENS=""
40
- else
41
- export MAX_NUM_BATCHED_TOKENS="--max-num-batched-tokens=$MAX_NUM_BATCHED_TOKENS"
42
- fi
43
-
44
- # Activate vllm venv
45
- if [ "$VENV_BASE" = "singularity" ]; then
46
- export SINGULARITY_IMAGE=/model-weights/vec-inf-shared/vector-inference_latest.sif
47
- export VLLM_NCCL_SO_PATH=/vec-inf/nccl/libnccl.so.2.18.1
48
- module load singularity-ce/3.8.2
49
- singularity exec $SINGULARITY_IMAGE ray stop
50
- singularity exec --nv --bind ${MODEL_WEIGHTS}:${MODEL_WEIGHTS} $SINGULARITY_IMAGE \
51
- python3.10 -m vllm.entrypoints.openai.api_server \
52
- --model ${MODEL_WEIGHTS} \
53
- --served-model-name ${MODEL_NAME} \
54
- --host "0.0.0.0" \
55
- --port ${vllm_port_number} \
56
- --tensor-parallel-size ${SLURM_GPUS_PER_NODE} \
57
- --dtype ${DATA_TYPE} \
58
- --max-logprobs ${MAX_LOGPROBS} \
59
- --trust-remote-code \
60
- --max-model-len ${MAX_MODEL_LEN} \
61
- --max-num-seqs ${MAX_NUM_SEQS} \
62
- --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
63
- --compilation-config ${COMPILATION_CONFIG} \
64
- --task ${TASK} \
65
- ${MAX_NUM_BATCHED_TOKENS} \
66
- ${ENABLE_PREFIX_CACHING} \
67
- ${ENABLE_CHUNKED_PREFILL} \
68
- ${ENFORCE_EAGER}
69
-
70
- else
71
- source ${VENV_BASE}/bin/activate
72
- python3 -m vllm.entrypoints.openai.api_server \
73
- --model ${MODEL_WEIGHTS} \
74
- --served-model-name ${MODEL_NAME} \
75
- --host "0.0.0.0" \
76
- --port ${vllm_port_number} \
77
- --tensor-parallel-size ${SLURM_GPUS_PER_NODE} \
78
- --dtype ${DATA_TYPE} \
79
- --max-logprobs ${MAX_LOGPROBS} \
80
- --trust-remote-code \
81
- --max-model-len ${MAX_MODEL_LEN} \
82
- --max-num-seqs ${MAX_NUM_SEQS} \
83
- --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
84
- --compilation-config ${COMPILATION_CONFIG} \
85
- --task ${TASK} \
86
- ${MAX_NUM_BATCHED_TOKENS} \
87
- ${ENABLE_PREFIX_CACHING} \
88
- ${ENABLE_CHUNKED_PREFILL} \
89
- ${ENFORCE_EAGER}
90
- fi
@@ -1,17 +0,0 @@
1
- vec_inf/README.md,sha256=dxX0xKfwLioG0mJ2YFv5JJ5q1m5NlWBrVBOap1wuHfQ,624
2
- vec_inf/__init__.py,sha256=bHwSIz9lebYuxIemni-lP0h3gwJHVbJnwExQKGJWw_Q,23
3
- vec_inf/find_port.sh,sha256=bGQ6LYSFVSsfDIGatrSg5YvddbZfaPL0R-Bjo4KYD6I,1088
4
- vec_inf/multinode_vllm.slurm,sha256=V01ayfgObPdxbQqhYvCbNIx0zqpLurDxZhS0UHYNFi0,5210
5
- vec_inf/vllm.slurm,sha256=VMMTdVUOtX4-Yv43yzgKiEpE56fMwuR0KOLf3Dar_S0,2884
6
- vec_inf/cli/__init__.py,sha256=5XIvGQCOnaGl73XMkwetjC-Ul3xuXGrWDXdYJ3aUzvU,27
7
- vec_inf/cli/_cli.py,sha256=8Gk4NRbrY2-3EX0S8_-1UOmGfahzqX0A2sJNVcL7OL8,6525
8
- vec_inf/cli/_config.py,sha256=pb2ERbxZoRZBa9Ie7-jlzyQiiXYZgqUetbw13Blryho,2841
9
- vec_inf/cli/_helper.py,sha256=9niBuFoaJfeP2yRHKcrkia4rwCZbAfoz-4s2MVCwA9w,26871
10
- vec_inf/cli/_utils.py,sha256=i3mffIJ-wBVpe3pz0mYa2W5J42yNRmUG2tQwABhQJDQ,5365
11
- vec_inf/config/README.md,sha256=3MYYY3hGw7jiMR5A8CMdBkhGFobcdH3Kip5E2saq_T4,18609
12
- vec_inf/config/models.yaml,sha256=p__omZEmoF93BtVQXQia43xvPX88ALrBT1tMiY2Bdhk,28787
13
- vec_inf-0.5.0.dist-info/METADATA,sha256=khBhIsW5hFjrp-ZQFU3wa5htLVkp4CH3JnrhLXeHfss,10228
14
- vec_inf-0.5.0.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
15
- vec_inf-0.5.0.dist-info/entry_points.txt,sha256=uNRXjCuJSR2nveEqD3IeMznI9oVI9YLZh5a24cZg6B0,49
16
- vec_inf-0.5.0.dist-info/licenses/LICENSE,sha256=mq8zeqpvVSF1EsxmydeXcokt8XnEIfSofYn66S2-cJI,1073
17
- vec_inf-0.5.0.dist-info/RECORD,,