Traditional data compression faces limitations with increasing data demands. LMCompress, a new framework, uses the predictive power of large language models to achieve lossless compression by “understanding” data like text, images, videos, and audio. This approach, based on approximating Solomonoff induction, offers novel compression methods. This article explores LMCompress’s architecture, principles, and applications.
Table of Content
- What is LMCompress?
- Architecture Breakdown
- Core Capabilities
- Compression Across Modalities
- Technical Deep Dive into LMCompress
- Real-World Performance
Let’s first start by understanding what LMCompress is.
What is LMCompress?
LMCompress is a lossless, all-purpose data compression method that uses LLMs to produce arithmetic coding predictive token distributions. LMCompress compresses data even more effectively than conventional techniques like FLAC, H.264, or JPEG-XL by converting different data kinds (text, image, audio, and video) into token sequences and feeding them into customized LLMs.
The innovation lies in its Kolmogorov compression paradigm, where compression quality is directly tied to how well a model understands the data, bypassing computable heuristics with near-Solomonoff induction approximations. This results in doubling or even quadrupling the compression ratios of conventional codecs.
Architecture Breakdown
LMCompress consists of a three-stage architecture that collectively enables its powerful and general-purpose compression capabilities.
Tokenization
In the first stage, data is transformed into a structured sequence of tokens that can be understood by a generative model. This step is tailored to the data type: images are linearized into sequences of pixels, audio signals are converted into strings of ASCII characters by mapping each byte, and texts are directly divided into standard token chunks. This ensures compatibility with transformer-based architectures.
Predictive Modeling
Once tokenization is done, the sequence is then passed into a large generative model trained to predict the probability of the next token. For visual data like images and videos, this system uses iGPT which is an autoregressive transformer optimized for pixel prediction.Whereas for audio and text data, It uses domain adapted versions of the LLaMA model, fine-tuned to capture specific signal or language patterns with high fidelity.
LMCompress architecture
Arithmetic Coding
Next, In the last step, real compression is carried out using arithmetic coding in accordance with the anticipated token distributions. By allocating shorter codes to higher likelihood tokens, this technique aids in encoding data into a compact binary form. The resulting compression is substantially more efficient than what we obtain from conventional entropy-based coding techniques since the token probabilities are generated from a thorough comprehension of the material.
Core Capabilities
LMCompress introduces several groundbreaking innovations which includes:
Universal Compression Framework
It supports a wide range of data formats which includes text, images, audio, and video. This unified architecture eliminates the need for n number of format specific codecs, making the system versatile and extensible.
Model-Aware Adaptability
This system uses domain specific fine tuning to adapt large models for particular types of data. This makes the results more accurate in probability estimates, boosting compression ratios across specialized datasets.
Key Features of LMCompress
Modular Tokenization
For various media types, LMCompress uses unique tokenisation techniques. By guaranteeing efficiency and compatibility across many formats, these modular layers optimise the way data is delivered to the model.
Lossless & Lossy Compression
Although it performs exceptionally well in lossless compression, it can also be used in lossy situations. When full accuracy is not required, methods such as diffusion-based generation improve perceived quality.
Outperforms All Baselines
It consistently exceeds traditional compression methods like JPEG-XL, FLAC, and H.264. In some benchmarks, it achieves near 4× better compression, setting new standards for data efficiency.
Compression Across Modalities
To achieve the best possible comprehension and compression, LMCompress tailors its generative models and tokenization methods to the specific characteristics of various data types.
Image Compression
In order to estimate next-pixel probabilities for images, LMCompress uses an image-GPT model (iGPT), which was selected due to its extensive training on visual data and its autoregressive nature. Concatenated pixels are passed into iGPT for compression, creating a one-dimensional sequence.
Video Compression
In lossless video compression, LMCompress treats each frame as an independent image and compresses it frame-by-frame using iGPT, bypassing the need for large autoregressive video models that output probabilities. For lossy video compression, LMCompress extends the “generative compression” idea, using DCVC results as a prior and sampling from a diffusion model (DDPM) to generate details for reconstruction.
Audio Compression
To achieve lossless audio compression, LMCompress processes audio at the signal level, avoiding discretization. Audio frames, represented as bytes, are mapped into ASCII characters, creating an “audio-as-string” representation. A large language model (LLAMA3-8B) is then fine-tuned with a low-rank adaptation layer on this audio-as-string data to estimate next-token probabilities.
Text Compression
For text compression, LMCompress uses domain-specific fine-tuning of LLMs. Improved compression ratios result from the model’s increased understanding of the domain’s features through the addition of an adaptation layer and the fine-tuning of an LLM (LLaMA3-8B) on domain-specific texts.
Technical Deep Dive into LMCompress
The success of LMCompress lies in its adaptive nature and careful handling of domain-specific characteristics. Key optimization strategies include:
Domain-Specific Models
LMCompress uses models that have been specially trained or optimised for each type of data, rather than relying on a single, general LLM (e.g., iGPT for images/videos, fine-tuned LLAMA3-8B for audio and domain-specific texts). This guarantees that the generative model “understands” the subtleties and patterns specific to that kind of input.
Tokenization and Context Management
To fit within the generative models’ context window constraints, data is carefully tokenised and split. Pixels in photos are transformed into a one-dimensional sequence. Amplitude data is converted to ASCII characters for audio.
Arithmetic Coding Integration
The generative models’ predicted probabilities are smoothly combined with a lossless compression method called arithmetic coding. Based on the model’s comprehension, this enables extremely effective encoding.
Lossy/Lossless Flexibility
As demonstrated by its video compression method, which uses generative models to recreate information rather than just removing redundancies, LMCompress exhibits adaptability to lossy settings despite its primary concentration on lossless compression.
Real-World Performance
LMCompress has proven to perform better than both general-purpose LLMs and conventional lossless compression algorithms on a wide range of data types. Higher ratios indicate better performance. The main statistic is the compression ratio, which is the ratio of the original data size to the compressed data size.
Image Compression
LMCompress more than doubles the compression ratios of state-of-the-art methods like JPEG-XL, PNG, WebP, and JPEG-2000 on datasets like CLIC2019 and ILSVRC2017. For instance, on CLIC2019, LMCompress achieved a ratio of 6.32 compared to JPEG-XL’s 2.931.
Lossless Video Compression
On Xiph.org video datasets (static and dynamic scenes), LMCompress showed over 20% improvement on static scenes and at least 50% improvement on dynamic scenes compared to baselines like FFV1, H.264, and H.265.
Lossy Video Compression
LMCompress more than doubles the compression ratio of DCVC and DCVC-FM on CIPR SIF Sequences, while maintaining or improving PSNR and FID scores.
Audio Compression
LMCompress outperforms FLAC by 25%-94% and other large-model based methods by 28%-55% on LibriSpeech and LJSpeech datasets.
Text Compression
LMCompress nearly triples the compression ratios of traditional methods like zlib, bzip2, and brotli on domain-specific texts like MeDAL (medicine) and Pile of Law (legal). It also significantly outperforms raw LLAMA3-8B, with improvements of 8.5% on MeDAL and 38.4% on Pile of Law.
Final Thoughts
LMCompress ushers in a new era of data compression powered by deep understanding. Its architecture, inspired by Solomonoff induction, not only beats prior benchmarks but redefines compression as an intelligent process rooted in prediction and adaptation. As LLMs continue to scale, LMCompress could reshape domains from 6G communication to storage, streaming, and even secure transmission.