Large language models (LLMs) generally require significant GPU infrastructure to operate, but there has been a strong push to get them running on more modest hardware. Quantization of a large model can reduce memory requirements, allowing a high-fidelity model to run on less expensive hardware or even a CPU. Efforts such as llama.cpp make it possible to run LLMs on hardware including Raspberry Pis, laptops and commodity servers.
Many organizations are deploying self-hosted LLMs. This is often due to security or privacy concerns, or, sometimes, a need to run models on edge devices. Open-source examples include GPT-J, GPT-JT and Llama. This approach offers better control of the model in fine-tuning for a specific use case, improved security and privacy as well as offline access. Although we've helped some of our clients self-host open-source LLMs for code completion, we recommend you carefully assess the organizational capabilities and the cost of running such LLMs before making the decision to self-host.
Large language models (LLMs) generally require significant GPU infrastructure to operate. We're now starting to see ports, like llama.cpp, that make it possible to run LLMs on different hardware — including Raspberry Pis, laptops and commodity servers. As such, self-hosted LLMs are now a reality, with open-source examples including GPT-J, GPT-JT and LLaMA. This approach has several benefits, offering better control in fine-tuning for a specific use case, improved security and privacy as well as offline access. However, you should carefully assess the capability within the organization and the cost of running such LLMs before making the decision to self-host.