.. _deploying_with_dstack: Deploying with dstack ============================ .. raw:: html

vLLM_plus_dstack

vLLM can be run on a cloud based GPU machine with `dstack `__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment. To install dstack client, run: .. code-block:: console $ pip install "dstack[all] $ dstack server Next, to configure your dstack project, run: .. code-block:: console $ mkdir -p vllm-dstack $ cd vllm-dstack $ dstack init Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`: .. code-block:: yaml type: service python: "3.11" env: - MODEL=NousResearch/Llama-2-7b-chat-hf port: 8000 resources: gpu: 24GB commands: - pip install vllm - vllm serve $MODEL --port 8000 model: format: openai type: chat name: NousResearch/Llama-2-7b-chat-hf Then, run the following CLI for provisioning: .. code-block:: console $ dstack run . -f serve.dstack.yml ⠸ Getting run plan... Configuration serve.dstack.yml Project deep-diver-main User deep-diver Min resources 2..xCPU, 8GB.., 1xGPU (24GB) Max price - Max duration - Spot policy auto Retry policy no # BACKEND REGION INSTANCE RESOURCES SPOT PRICE 1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 ... Shown 3 of 193 offers, $5.876 max Continue? [y/n]: y ⠙ Submitting run... ⠏ Launching spicy-treefrog-1 (pulling) spicy-treefrog-1 provisioning completed (running) Service is published at ... After the provisioning, you can interact with the model by using the OpenAI SDK: .. code-block:: python from openai import OpenAI client = OpenAI( base_url="https://gateway.", api_key="" ) completion = client.chat.completions.create( model="NousResearch/Llama-2-7b-chat-hf", messages=[ { "role": "user", "content": "Compose a poem that explains the concept of recursion in programming.", } ] ) print(completion.choices[0].message.content) .. note:: dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository `__