vLLM is an open-source library designed to optimize the inference and serving. Originally developed at UC Berkeley's Sky Computing Lab, it has evolved into a community-driven project. The library is built around the innovative PagedAttention algorithm, which significantly improves memory management by reducing waste in Key-Value (KV) cache memory.
Refer to Getting Started on Aurora for additional information. In particular, you need to set the environment variables that provide access to the proxy host.
Note
The instructions below should be run directly from a compute node. Explicitly, to request an interactive job (from aurora-uan):
To ensure your workflows utilize the preloaded model weights and datasets, update the following environment variables in your session. Some models hosted on Hugging Face may be gated, requiring additional authentication. To access these gated models, you will need a Hugging Face authentication token.
For small models that fit within a single tile's memory (64 GB), no additional configuration is required to serve the model. Simply set TP=1 (Tensor Parallelism). This configuration ensures the model is run on a single tile without the need for distributed setup. Models with fewer than 7 billion parameters typically fit within a single tile. To utilize multiple tiles for larger models (TP>1), a more advanced setup is necessary. This involves configuring a Ray cluster and setting the ZE_FLAT_DEVICE_HIERARCHY environment variable:
These commands set up a Ray cluster and serves meta-llama/Llama-3.3-70B-Instruct on 8 tiles on single node. Models with up to 70 billion parameters can usually fit within a single node, utilizing multiple tiles.
The following example serves meta-llama/Llama-3.1-405B-Instruct model using 2 nodes with TP=8 and PP=2. Models exceeding 70 billion parameters generally require more than one Aurora node. First, use setup_ray_cluster.sh script to setup a Ray cluster across nodes:
######################################################################### FUNCTIONS######################################################################### Setup environment and variables needed to setup ray and vllmsetup_environment(){echo"[$(hostname)] Setting up the environment..."# Set proxy configurationsexportHTTP_PROXY="http://proxy.alcf.anl.gov:3128"exportHTTPS_PROXY="http://proxy.alcf.anl.gov:3128"exporthttp_proxy="http://proxy.alcf.anl.gov:3128"exporthttps_proxy="http://proxy.alcf.anl.gov:3128"exportftp_proxy="http://proxy.alcf.anl.gov:3128"# Define the common setup script path (make sure this file is accessible on all nodes)exportCOMMON_SETUP_SCRIPT="/path/to/setup_ray_cluster.sh"# Load modules and activate your conda environmentmoduleloadframeworks
condaactivatevllm_0125
moduleunloadoneapi/eng-compiler/2024.07.30.002
moduleuse/opt/aurora/24.180.3/spack/unified/0.8.0/install/modulefiles/oneapi/2024.07.30.002
moduleuse/soft/preview/pe/24.347.0-RC2/modulefiles
moduleaddoneapi/release
exportTORCH_LLM_ALLREDUCE=1exportCCL_ZE_IPC_EXCHANGE=drmfd
exportZE_FLAT_DEVICE_HIERARCHY=FLAT
exportHF_TOKEN="YOUR_HF_TOKEN"exportHF_HOME="/flare/datascience/model-weights/hub"exportHF_DATASETS_CACHE="/flare/datascience/model-weights/hub"exportTMPDIR="/tmp"exportRAY_TMPDIR="/tmp"exportVLLM_IMAGE_FETCH_TIMEOUT=60ulimit-cunlimited
# Derive the node's HSN IP address (modify the getent command as needed)exportHSN_IP_ADDRESS=$(getenthosts"$(hostname).hsn.cm.aurora.alcf.anl.gov"|awk'{ print $1 }'|sort|head-n1)exportVLLM_HOST_IP="$HSN_IP_ADDRESS"echo"[$(hostname)] Environment setup complete. HSN_IP_ADDRESS is $HSN_IP_ADDRESS"}# Stop any running Ray processesstop_ray(){echo"[$(hostname)] Stopping Ray (if running)..."raystop-f
}# Start Ray head nodestart_ray_head(){echo"[$(hostname)] Starting Ray head..."raystart--num-gpus=8--num-cpus=64--head--node-ip-address="$HSN_IP_ADDRESS"--temp-dir=/tmp
# Wait until Ray reports that the head node is upecho"[$(hostname)] Waiting for Ray head to be up..."untilraystatus&>/dev/null;dosleep5echo"[$(hostname)] Waiting for Ray head..."doneecho"[$(hostname)] ray status: $(raystatus)"echo"[$(hostname)] Ray head node is up."}# Start Ray worker nodestart_ray_worker(){echo"[$(hostname)] Starting Ray worker, connecting to head at $RAY_HEAD_IP..."echo"HSN IP Address : $HSN_IP_ADDRESS"raystart--num-gpus=8--num-cpus=64--address="$RAY_HEAD_IP:6379"--node-ip-address="$HSN_IP_ADDRESS"--temp-dir=/tmp
echo"[$(hostname)] Waiting for Ray worker to be up..."untilraystatus&>/dev/null;dosleep5echo"[$(hostname)] Waiting for Ray worker..."doneecho"[$(hostname)] ray status: $(raystatus)"echo"[$(hostname)] Ray worker node is up."}######################################################################### MAIN SCRIPT LOGIC########################################################################main(){# Ensure that the script is being run within a PBS jobif[-z"$PBS_NODEFILE"];thenecho"Error: PBS_NODEFILE not set. This script must be run within a PBS job allocation."exit1fi# Read all nodes from the PBS_NODEFILE into an array.mapfile-tnodes_full<"$PBS_NODEFILE"num_nodes=${#nodes_full[@]}echo"Allocated nodes ($num_nodes):"printf" - %s\n""${nodes_full[@]}"# Require at least 2 nodes (one head + one worker)if["$num_nodes"-lt2];thenecho"Error: Need at least 2 nodes to launch the Ray cluster."exit1fi# The first node will be our Ray head.head_node_full="${nodes_full[0]}"# All remaining nodes will be the workers.worker_nodes_full=("${nodes_full[@]:1}")# It is a good idea to run this master script on the designated head node.current_node=$(hostname-f)echo"[$(hostname)] Running on head node."# --- Setup and start the head node ---setup_environment
stop_ray
start_ray_head
# Export the head node's IP so that workers can join.exportRAY_HEAD_IP="$HSN_IP_ADDRESS"echo"[$(hostname)] RAY_HEAD_IP exported as $RAY_HEAD_IP"# --- Launch Ray workers on each of the other nodes via SSH ---forworkerin"${worker_nodes_full[@]}";doecho"[$(hostname)] Launching Ray worker on $worker..."ssh"$worker""bash -l -c 'set -x; export RAY_HEAD_IP=${RAY_HEAD_IP}; export COMMON_SETUP_SCRIPT="/flare/datascience/sraskar/vllm-2025_1_release/vllm-2025_1/vllm/examples/submit-dist.sh" ;source \$COMMON_SETUP_SCRIPT; setup_environment; stop_ray; start_ray_worker'"&done# Wait for all background SSH jobs to finish.waitecho"[$(hostname)] Ray cluster is up and running with $num_nodes nodes."}main
Setting --max-model-len is important in order to fit this model on 2 nodes. In order to use higher --max-model-len values, you will need to use additonal nodes.