Deep Dives

A Practitioner’s Guide on Inferencing over 1-bit LLMs using bitnet.cpp

Explore 1-bit LLMs and bitnet.cpp for faster, efficient inferencing in large language models.

Explore more from ADaSci

A Deep Dive into ElasticSearch and Kibana’s Semantic Capabilities

Stock Price Prediction Using Deep Learning Models

Image-to-Text Generation with PaliGemma Multimodal Model: A Hands-on Guide

A Hands-on Guide to Enhance RAG with Re-Ranking

Deep Dive into Open Source RL for Large Scale LLMs DAPO

An Empirical Analysis of Deep Learning Models for Electric Vehicle Load Disaggregation

Why do Enterprises Love RAG?

An Ensemble Model for Face Liveness Detection

Generative AI and the Road to Singularity: A Philosophical Journey into AI’s Future

Genpact Launches SkyDive Global Campus Academy 2024 with ADaSci

1-bit LLMs are an important innovation in the area of large language models. Unlike traditional LLMs that use 32-bit or 16-bit floating point numbers to represent weights and activations, 1-bit LLMs quantise the values to just 1-bit. This reduces the computational footprint and increases the inferencing speed drastically. Recently, Microsoft released bitnet.cpp, a framework for faster and lossless inferencing over 1-bit LLMs. This article covers it in depth and explains its utility in LLM landscape.

Understanding 1-bit LLMs
Overview of Lossless Inferencing through bitnet.cpp
Hands-on Implementation of bitnet.cpp

Let’s understand the process of 1-bit LLM inferencing through bitnet.cpp in depth.

Understanding 1-bit LLMs

1-bit LLMs are a unique approach to large language models that drastically reduce the computational resources required for training and inference. Unlike traditional LLMs that use 32-bit or 16-bit floating point numbers, 1-bit LLMs quantise weights and activations to only 1-bit, significantly compressing the model size and accelerating the inference generation.

Let’s understand the difference between traditional and 1-bit LLMs with a simple comparison:

Traditional LLMs –

Weight Precision = 32 bits = 4 bytes
Parameters = 7 Billion
Inference memory estimation = (7,000,000,000 * 4)/(1024^{^}3) GBs = 26.077 GBs

1-bit LLMs –

Weight Precision = 1 bit = 0.125 bytes
Parameters = 7 Billion
Inference memory estimation = (7,000,000,000 * 0.125)/(1024^{^}3) GBs = 0.815 GBs

We can see the difference in the computational and storage resources required between traditional (32-bit) LLMs and 1-bit LLMs.

1-bit LLM variant, namely BitNet b1.58 uses 1.58 bits per weight, and stores weight in a ternary format [-1,0,1] meaning, a weight can be -1, 0 or +1. This format means that the matrix multiplications happening in normal transformer models are replaced by simple addition and subtraction making it computationally less intensive.

1-bit LLMs provide a Pareto solution to reduce inference cost (Source)

BitNet b1.58 is based on BitNet architecture, retaining all the benefits of the original 1-bit BitNet, including a new computation paradigm that requires almost no multiplication operations for matrix multiplication and can be highly optimized. It also exhibits the same energy consumption as the original 1-bit model.

Overview of Lossless Inferencing through bitnet.cpp

The official inference framework for 1-bit LLMs such as BitNet 1.58 is bitnet.cpp, which Microsoft recently open-sourced. It offers a set of optimised kernels that support fast and lossless inference of 1.58-bit models on the CPU.

bitnet.cpp (https://github.com/microsoft/BitNet) achieved significant speedups ranging from 2.37x to 6.17x on x86 CPUs with energy reductions between 71.9% and 82.8%. On ARM CPUs it achieved speedups ranging from 1.37x to 5.07x, across different model sizes with the reduction in energy consumption from 55.4% to 70%, further boosting overall efficiency.

Inference speed and energy consumption for different BitNet b1.58 model sizes (Apple M2 Ultra) (Source)

Inference speed and energy consumption for different BitNet b1.58 model sizes (Intel i7-13700H) (Source)

bitnet.cpp offers a set of optimized kernels that are designed for fast and lossless inference of 1.58-bit models on both ARM and x86 architectures. The evaluation of bitnet.cpp was also accomplished in terms of both inference speed and energy consumption. It demonstrates a significant improvement over llama.cpp on both ARM and x86 architectures, especially when the model size increases.

Hands-on Implementation of bitnet.cpp

Step 1: Cloning the official repository –

!git clone --recursive https://github.com/microsoft/BitNet.git

Step 2: Changing the present working directory –

%cd /content/BitNet/

Step 3: Installing the required dependencies using the requirements.txt –

!pip install -r requirements.txt

Step 4: Downloading the 1-bit model from HuggingFace and converting it into GGUF format –

!huggingface-cli download 1bitLLM/bitnet_b1_58-large --local-dir bitnet_b1_58-large
!python setup_env.py -md bitnet_b1_58-large

Step 5 – Running inference with inference.py script –

!python run_inference.py -m bitnet_b1_58-large/ggml-model-i2_s.gguf -p "John went to eat dinner while his friend Janardan was playing on PlayStation. While playing Playstation, Janardan called his friend Johny. Who is Johny to John?\nAnswer:" -n 50 -temp 0

Output –

Answer: Johny is a friend of John.

Final Words

1-bit LLMs are significantly smaller than traditional LLMs and the reduced precision also leads to faster computations, especially when the hardware is optimised for bitwise operations. bitnet.cpp offers a huge improvement in terms of optimised kernels for inferencing 1-bit LLMs supporting faster inference and lower energy consumption. It’s still in infancy stage but is poised to bring a significant change in the era of LLM inferencing and energy reduction.

References

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Latest Courses

A Practitioner’s Guide on Inferencing over 1-bit LLMs using bitnet.cpp

Explore more from ADaSci

Table of Contents

Understanding 1-bit LLMs

Overview of Lossless Inferencing through bitnet.cpp

Hands-on Implementation of bitnet.cpp

Final Words

References

Sachin Tripathi

The Chartered Data Scientist Designation

Elevate Your Team's AI Skills with our Proven Training Programs

Our AI Courses

Agentic AI in Production: Hands-On Workshop

Agentic AI Workforce Readiness Strategies for CXOs

MCP and A2A – The AI Protocols for Next-Gen Agent Ecosystems

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.

The power of intelligence to propel humanity and make a difference

Our Accrediations

CDS Program

Membership

About

For Organizations

Journal