Mastering Data Compression with LLMs via LMCompress

LMCompress uses large language models to achieve state of the art, lossless compression across text, image, audio, and video by approximating Solomonoff induction.

Traditional data compression faces limitations with increasing data demands. LMCompress, a new framework, uses the predictive power of large language models to achieve lossless compression by “understanding” data like text, images, videos, and audio. This approach, based on approximating Solomonoff induction, offers novel compression methods. This article explores LMCompress’s architecture, principles, and applications.

Table of Content

  • What is LMCompress?
  • Architecture Breakdown
  • Core Capabilities
  • Compression Across Modalities
  • Technical Deep Dive into LMCompress
  • Real-World Performance

Let’s first start by understanding what LMCompress is.

What is LMCompress?

LMCompress is a lossless, all-purpose data compression method that uses LLMs to produce arithmetic coding predictive token distributions. LMCompress compresses data even more effectively than conventional techniques like FLAC, H.264, or JPEG-XL by converting different data kinds (text, image, audio, and video) into token sequences and feeding them into customized LLMs.

The innovation lies in its Kolmogorov compression paradigm, where compression quality is directly tied to how well a model understands the data, bypassing computable heuristics with near-Solomonoff induction approximations. This results in doubling or even quadrupling the compression ratios of conventional codecs.

Architecture Breakdown

LMCompress consists of a three-stage architecture that collectively enables its powerful and general-purpose compression capabilities.

Tokenization

In the first stage, data is transformed into a structured sequence of tokens that can be understood by a generative model. This step is tailored to the data type: images are linearized into sequences of pixels, audio signals are converted into strings of ASCII characters by mapping each byte, and texts are directly divided into standard token chunks. This ensures compatibility with transformer-based architectures.

Predictive Modeling

Once tokenization is done, the sequence is then passed into a large generative model trained to predict the probability of the next token. For visual data like images and videos, this system uses iGPT which is an autoregressive transformer optimized for pixel prediction.Whereas for audio and text data, It uses domain adapted versions of the LLaMA model, fine-tuned to capture specific signal or language patterns with high fidelity.

 LMCompress architecture

LMCompress architecture

Arithmetic Coding

Next, In the last step, real compression is carried out using arithmetic coding in accordance with the anticipated token distributions. By allocating shorter codes to higher likelihood tokens, this technique aids in encoding data into a compact binary form. The resulting compression is substantially more efficient than what we obtain from conventional entropy-based coding techniques since the token probabilities are generated from a thorough comprehension of the material.

Core Capabilities

LMCompress introduces several groundbreaking innovations which includes:

Universal Compression Framework

It supports a wide range of data formats which includes text, images, audio, and video. This unified architecture eliminates the need for n number of format specific codecs, making the system versatile and extensible.

Model-Aware Adaptability

This system uses domain specific fine tuning to adapt large models for particular types of data. This makes the results more accurate in probability estimates, boosting compression ratios across specialized datasets.

Key Features of LMCompress

Key Features of LMCompress

Modular Tokenization

For various media types, LMCompress uses unique tokenisation techniques.  By guaranteeing efficiency and compatibility across many formats, these modular layers optimise the way data is delivered to the model.

Lossless & Lossy Compression

Although it performs exceptionally well in lossless compression, it can also be used in lossy situations.  When full accuracy is not required, methods such as diffusion-based generation improve perceived quality.

Outperforms All Baselines

It consistently exceeds traditional compression methods like JPEG-XL, FLAC, and H.264. In some benchmarks, it achieves near 4× better compression, setting new standards for data efficiency.

Compression Across Modalities

To achieve the best possible comprehension and compression, LMCompress tailors its generative models and tokenization methods to the specific characteristics of various data types.

Image Compression

In order to estimate next-pixel probabilities for images, LMCompress uses an image-GPT model (iGPT), which was selected due to its extensive training on visual data and its autoregressive nature.  Concatenated pixels are passed into iGPT for compression, creating a one-dimensional sequence.

Video Compression

In lossless video compression, LMCompress treats each frame as an independent image and compresses it frame-by-frame using iGPT, bypassing the need for large autoregressive video models that output probabilities. For lossy video compression, LMCompress extends the “generative compression” idea, using DCVC results as a prior and sampling from a diffusion model (DDPM) to generate details for reconstruction.

Audio Compression

To achieve lossless audio compression, LMCompress processes audio at the signal level, avoiding discretization. Audio frames, represented as bytes, are mapped into ASCII characters, creating an “audio-as-string” representation. A large language model (LLAMA3-8B) is then fine-tuned with a low-rank adaptation layer on this audio-as-string data to estimate next-token probabilities.

Text Compression

For text compression, LMCompress uses domain-specific fine-tuning of LLMs.  Improved compression ratios result from the model’s increased understanding of the domain’s features through the addition of an adaptation layer and the fine-tuning of an LLM (LLaMA3-8B) on domain-specific texts.

Technical Deep Dive into LMCompress

The success of LMCompress lies in its adaptive nature and careful handling of domain-specific characteristics. Key optimization strategies include:

Domain-Specific Models

LMCompress uses models that have been specially trained or optimised for each type of data, rather than relying on a single, general LLM (e.g., iGPT for images/videos, fine-tuned LLAMA3-8B for audio and domain-specific texts). This guarantees that the generative model “understands” the subtleties and patterns specific to that kind of input.

Tokenization and Context Management

To fit within the generative models’ context window constraints, data is carefully tokenised and split. Pixels in photos are transformed into a one-dimensional sequence. Amplitude data is converted to ASCII characters for audio.

Arithmetic Coding Integration

The generative models’ predicted probabilities are smoothly combined with a lossless compression method called arithmetic coding.  Based on the model’s comprehension, this enables extremely effective encoding.

Lossy/Lossless Flexibility

As demonstrated by its video compression method, which uses generative models to recreate information rather than just removing redundancies, LMCompress exhibits adaptability to lossy settings despite its primary concentration on lossless compression.

Real-World Performance

LMCompress has proven to perform better than both general-purpose LLMs and conventional lossless compression algorithms on a wide range of data types. Higher ratios indicate better performance. The main statistic is the compression ratio, which is the ratio of the original data size to the compressed data size.

Image Compression

LMCompress more than doubles the compression ratios of state-of-the-art methods like JPEG-XL, PNG, WebP, and JPEG-2000 on datasets like CLIC2019 and ILSVRC2017. For instance, on CLIC2019, LMCompress achieved a ratio of 6.32 compared to JPEG-XL’s 2.931.

Lossless Video Compression

On Xiph.org video datasets (static and dynamic scenes), LMCompress showed over 20% improvement on static scenes and at least 50% improvement on dynamic scenes compared to baselines like FFV1, H.264, and H.265.

Lossy Video Compression

LMCompress more than doubles the compression ratio of DCVC and DCVC-FM on CIPR SIF Sequences, while maintaining or improving PSNR and FID scores.

Audio Compression

LMCompress outperforms FLAC by 25%-94% and other large-model based methods by 28%-55% on LibriSpeech and LJSpeech datasets.

Text Compression

LMCompress nearly triples the compression ratios of traditional methods like zlib, bzip2, and brotli on domain-specific texts like MeDAL (medicine) and Pile of Law (legal). It also significantly outperforms raw LLAMA3-8B, with improvements of 8.5% on MeDAL and 38.4% on Pile of Law.

Final Thoughts

LMCompress ushers in a new era of data compression powered by deep understanding. Its architecture, inspired by Solomonoff induction, not only beats prior benchmarks but redefines compression as an intelligent process rooted in prediction and adaptation. As LLMs continue to scale, LMCompress could reshape domains from 6G communication to storage, streaming, and even secure transmission.

References

Picture of Aniruddha Shrikhande

Aniruddha Shrikhande

Aniruddha Shrikhande is an AI enthusiast and technical writer with a strong focus on Large Language Models (LLMs) and generative AI. Committed to demystifying complex AI concepts, he specializes in creating clear, accessible content that bridges the gap between technical innovation and practical application. Aniruddha's work explores cutting-edge AI solutions across various industries. Through his writing, Aniruddha aims to inspire and educate, contributing to the dynamic and rapidly expanding field of artificial intelligence.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.