Enable javascript in your browser for better experience. Need to know to enable it? Go here.
Published : Oct 23, 2024
Oct 2024
Trial ?

FastChat is an open platform for training, serving and evaluating large language models. Our teams use its model-serving capabilities to host multiple models — Llama 3.1 (8B and 70B), Mistral 7B and Llama-SQL — for different purposes, all in a consistent OpenAI API format. FastChat operates on a controller-worker architecture, allowing multiple workers to host different models. It supports worker types such as vLLM, LiteLLM and MLX. We use vLLM model workers for their high throughput capabilities. Depending on the use case — latency or throughput — different types of FastChat model workers can be created and scaled. For example, the model used for code suggestions in developer IDEs requires low latency and can be scaled with multiple FastChat workers to handle concurrent requests efficiently. In contrast, the model used for Text-to-SQL doesn't need multiple workers due to lower demand or different performance requirements. Our teams leverage FastChat's scaling capabilities for A/B testing. We configure FastChat workers with the same model but different hyperparameter values and pose identical questions to each, identifying optimal hyperparameter values. When transitioning models in live services, we conduct A/B tests to ensure seamless migration. For example, we recently migrated from CodeLlama 70B to Llama 3.1 70B for code suggestions. By running both models concurrently and comparing outputs, we verified the new model met or exceeded the previous model's performance without disrupting the developer experience.

Download the PDF

 

 

English | Español | Português | 中文

Sign up for the Technology Radar newsletter

 

Subscribe now

Visit our archive to read previous volumes