A Practitioner’s Guide on Inferencing over 1-bit LLMs using bitnet.cpp

Explore 1-bit LLMs and bitnet.cpp for faster, efficient inferencing in large language models.

1-bit LLMs are an important innovation in the area of large language models. Unlike traditional LLMs that use 32-bit or 16-bit floating point numbers to represent weights and activations, 1-bit LLMs quantise the values to just 1-bit. This reduces the computational footprint and increases the inferencing speed drastically. Recently, Microsoft released bitnet.cpp, a framework for faster and lossless inferencing over 1-bit LLMs. This article covers it in depth and explains its utility in LLM landscape. 

Table of Contents

  1. Understanding 1-bit LLMs
  2. Overview of Lossless Inferencing through bitnet.cpp 
  3. Hands-on Implementation of bitnet.cpp

Let’s understand the process of 1-bit LLM inferencing through bitnet.cpp in depth. 

Understanding 1-bit LLMs

1-bit LLMs are a unique approach to large language models that drastically reduce the computational resources required for training and inference. Unlike traditional LLMs that use 32-bit or 16-bit floating point numbers, 1-bit LLMs quantise weights and activations to only 1-bit, significantly compressing the model size and accelerating the inference generation. 

Let’s understand the difference between traditional and 1-bit LLMs with a simple comparison: 

Traditional LLMs –

  • Weight Precision = 32 bits = 4 bytes
  • Parameters = 7 Billion
  • Inference memory estimation = (7,000,000,000 * 4)/(1024^3) GBs = 26.077 GBs

1-bit LLMs – 

  • Weight Precision = 1 bit = 0.125 bytes
  • Parameters = 7 Billion
  • Inference memory estimation = (7,000,000,000 * 0.125)/(1024^3) GBs = 0.815 GBs

We can see the difference in the computational and storage resources required between traditional (32-bit) LLMs and 1-bit LLMs. 

1-bit LLM variant, namely BitNet b1.58 uses 1.58 bits per weight, and stores weight in a ternary format [-1,0,1] meaning, a weight can be -1, 0 or +1. This format means that the matrix multiplications happening in normal transformer models are replaced by simple addition and subtraction making it computationally less intensive. 

1-bit LLMs provide a Pareto solution to reduce inference cost (Source)

BitNet b1.58 is based on BitNet architecture, retaining all the benefits of the original 1-bit BitNet, including a new computation paradigm that requires almost no multiplication operations for matrix multiplication and can be highly optimized. It also exhibits the same energy consumption as the original 1-bit model. 

Overview of Lossless Inferencing through bitnet.cpp

The official inference framework for 1-bit LLMs such as BitNet 1.58 is bitnet.cpp, which Microsoft recently open-sourced. It offers a set of optimised kernels that support fast and lossless inference of 1.58-bit models on the CPU. 

bitnet.cpp (https://github.com/microsoft/BitNet) achieved significant speedups ranging from 2.37x to 6.17x on x86 CPUs with energy reductions between 71.9% and 82.8%. On ARM CPUs it achieved speedups ranging from 1.37x to 5.07x, across different model sizes with the reduction in energy consumption from 55.4% to 70%, further boosting overall efficiency.   

Inference speed and energy consumption for different BitNet b1.58 model sizes (Apple M2 Ultra) (Source)

Inference speed and energy consumption for different BitNet b1.58 model sizes (Intel i7-13700H) (Source)

bitnet.cpp offers a set of optimized kernels that are designed for fast and lossless inference of 1.58-bit models on both ARM and x86 architectures. The evaluation of bitnet.cpp was also accomplished in terms of both inference speed and energy consumption. It demonstrates a significant improvement over llama.cpp on both ARM and x86 architectures, especially when the model size increases. 

Hands-on Implementation of bitnet.cpp

Step 1: Cloning the official repository – 

Step 2: Changing the present working directory – 

Step 3: Installing the required dependencies using the requirements.txt – 

Step 4: Downloading the 1-bit model from HuggingFace and converting it into GGUF format –

Step 5 – Running inference with inference.py script – 

Output – 

Final Words

1-bit LLMs are significantly smaller than traditional LLMs and the reduced precision also leads to faster computations, especially when the hardware is optimised for bitwise operations. bitnet.cpp offers a huge improvement in terms of optimised kernels for inferencing 1-bit LLMs supporting faster inference and lower energy consumption. It’s still in infancy stage but is poised to bring a significant change in the era of LLM inferencing and energy reduction. 

References

  1. Link to Code
  2. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
  3. 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
Picture of Sachin Tripathi

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.

Subscribe to our Newsletter