1-bit LLMs are an important innovation in the area of large language models. Unlike traditional LLMs that use 32-bit or 16-bit floating point numbers to represent weights and activations, 1-bit LLMs quantise the values to just 1-bit. This reduces the computational footprint and increases the inferencing speed drastically. Recently, Microsoft released bitnet.cpp, a framework for faster and lossless inferencing over 1-bit LLMs. This article covers it in depth and explains its utility in LLM landscape.
Table of Contents
- Understanding 1-bit LLMs
- Overview of Lossless Inferencing through bitnet.cpp
- Hands-on Implementation of bitnet.cpp
Let’s understand the process of 1-bit LLM inferencing through bitnet.cpp in depth.
Understanding 1-bit LLMs
1-bit LLMs are a unique approach to large language models that drastically reduce the computational resources required for training and inference. Unlike traditional LLMs that use 32-bit or 16-bit floating point numbers, 1-bit LLMs quantise weights and activations to only 1-bit, significantly compressing the model size and accelerating the inference generation.
Let’s understand the difference between traditional and 1-bit LLMs with a simple comparison:
Traditional LLMs –
- Weight Precision = 32 bits = 4 bytes
- Parameters = 7 Billion
- Inference memory estimation = (7,000,000,000 * 4)/(1024^3) GBs = 26.077 GBs
1-bit LLMs –
- Weight Precision = 1 bit = 0.125 bytes
- Parameters = 7 Billion
- Inference memory estimation = (7,000,000,000 * 0.125)/(1024^3) GBs = 0.815 GBs
We can see the difference in the computational and storage resources required between traditional (32-bit) LLMs and 1-bit LLMs.
1-bit LLM variant, namely BitNet b1.58 uses 1.58 bits per weight, and stores weight in a ternary format [-1,0,1] meaning, a weight can be -1, 0 or +1. This format means that the matrix multiplications happening in normal transformer models are replaced by simple addition and subtraction making it computationally less intensive.
1-bit LLMs provide a Pareto solution to reduce inference cost (Source)
BitNet b1.58 is based on BitNet architecture, retaining all the benefits of the original 1-bit BitNet, including a new computation paradigm that requires almost no multiplication operations for matrix multiplication and can be highly optimized. It also exhibits the same energy consumption as the original 1-bit model.
Overview of Lossless Inferencing through bitnet.cpp
The official inference framework for 1-bit LLMs such as BitNet 1.58 is bitnet.cpp, which Microsoft recently open-sourced. It offers a set of optimised kernels that support fast and lossless inference of 1.58-bit models on the CPU.
bitnet.cpp (https://github.com/microsoft/BitNet) achieved significant speedups ranging from 2.37x to 6.17x on x86 CPUs with energy reductions between 71.9% and 82.8%. On ARM CPUs it achieved speedups ranging from 1.37x to 5.07x, across different model sizes with the reduction in energy consumption from 55.4% to 70%, further boosting overall efficiency.
Inference speed and energy consumption for different BitNet b1.58 model sizes (Apple M2 Ultra) (Source)
Inference speed and energy consumption for different BitNet b1.58 model sizes (Intel i7-13700H) (Source)
bitnet.cpp offers a set of optimized kernels that are designed for fast and lossless inference of 1.58-bit models on both ARM and x86 architectures. The evaluation of bitnet.cpp was also accomplished in terms of both inference speed and energy consumption. It demonstrates a significant improvement over llama.cpp on both ARM and x86 architectures, especially when the model size increases.
Hands-on Implementation of bitnet.cpp
Step 1: Cloning the official repository –
!git clone --recursive https://github.com/microsoft/BitNet.git
Step 2: Changing the present working directory –
%cd /content/BitNet/
Step 3: Installing the required dependencies using the requirements.txt –
!pip install -r requirements.txt
Step 4: Downloading the 1-bit model from HuggingFace and converting it into GGUF format –
!huggingface-cli download 1bitLLM/bitnet_b1_58-large --local-dir bitnet_b1_58-large
!python setup_env.py -md bitnet_b1_58-large
Step 5 – Running inference with inference.py script –
!python run_inference.py -m bitnet_b1_58-large/ggml-model-i2_s.gguf -p "John went to eat dinner while his friend Janardan was playing on PlayStation. While playing Playstation, Janardan called his friend Johny. Who is Johny to John?\nAnswer:" -n 50 -temp 0
Output –
Answer: Johny is a friend of John.
Final Words
1-bit LLMs are significantly smaller than traditional LLMs and the reduced precision also leads to faster computations, especially when the hardware is optimised for bitwise operations. bitnet.cpp offers a huge improvement in terms of optimised kernels for inferencing 1-bit LLMs supporting faster inference and lower energy consumption. It’s still in infancy stage but is poised to bring a significant change in the era of LLM inferencing and energy reduction.