vLLM is a high-throughput, memory-efficient inference engine for LLMs that can run in the cloud or on-premise. It seamlessly supports multiple model architectures and popular open-source models. Our teams deploy dockerized vLLM workers on GPU platforms like NVIDIA DGX and Intel HPC, hosting models such as Llama 3.1(8B and 70B), Mistral 7B and Llama-SQL for developer coding assistance, knowledge search and natural language database interactions. vLLM is compatible with the OpenAI SDK standard, facilitating consistent model serving. Azure's AI Model Catalog uses a custom inference container to enhance model serving performance, with vLLM as the default inference engine due to its high throughput and efficient memory management. The vLLM framework is emerging as a default for large-scale model deployments.
vLLM is a high-throughput and memory-efficient inferencing and serving engine for large language models (LLMs) that’s particularly effective thanks to its implementation of continuous batching for incoming requests. It supports several deployment options, including deployment of distributed tensor-parallel inference and serving with Ray run time, deployment in the cloud with SkyPilot and deployment with NVIDIA Triton, Docker and LangChain. Our teams have had good experience running dockerized vLLM workers in an on-prem virtual machine, integrating with OpenAI compatible API server -— which, in turn, is leveraged by a range of applications, including IDE plugins for coding assistance and chatbots. Our teams leverage vLLM for running models such as CodeLlama 70B, CodeLlama 7B and Mixtral. Also notable is the engine’s scaling capability: it only takes a couple of config changes to go from running a 7B to a 70B model. If you’re looking to productionize LLMs, vLLM is worth exploring.
 
  
                        
                    
                    
                 
    
    
  