GGML is a C library for machine learning that allows for CPU inferencing. It defines a binary format for distributing large language models (LLMs). To do that it uses quantization, a technique that allows LLMs to run on consumer hardware with effective CPU inferencing. GGML supports a number of different quantization strategies (e.g., 4-bit, 5-bit, and 8-bit quantization), each of which offers different trade-offs between efficiency and performance. A quick way to test, run and build apps with these quantized models is a Python binding called C Transformers. This is a Python wrapper on top of GGML that takes away the boilerplate code for inferencing by providing a high level API. We've leveraged these libraries to build proof of concepts and experiments. If you're considering self-hosted LLMs, carefully assess these community-supported libraries for your organization.