Microsoft has unveiled an innovative method to significantly speed up the inference of Large Language Models (LLMs) while reducing their size by up to 20 times with minimal performance loss. This breakthrough is facilitated by their new library, LLMLingua, which can be implemented in just a couple of minutes. Here’s a comprehensive guide to get you started.
Introduction to LLMLingua
LLMLingua is a powerful library designed to compress prompts for LLMs, effectively reducing the token count and enhancing the speed of inference. This method uses a compact, well-trained language model (e.g., GPT-2 small, LLaMA-7B) to identify and remove non-essential tokens from prompts, making LLMs more efficient.
Benefits
- Cost Reduction: Achieves up to 20x compression, significantly lowering computational costs.
- Performance Boost: Enhances LLM inference speed without a substantial loss in performance.
- Ease of Use: Can be integrated with minimal code changes.
Quick Start Guide
Step 1: Install LLMLingua
To begin, install the LLMLingua library using pip:
bashCopy code!pip install llmlingua
Step 2: Import and Initialize LLMLingua
Import the necessary modules and initialize LLMLingua in your Python environment:
pythonCopy codefrom llmlingua import PromptCompressor
llm_lingua = PromptCompressor()
Step 3: Compress Prompts
You can now use LLMLingua to compress your prompts. Here’s an example of how to compress a sample prompt:
pythonCopy codeprompt = "Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box. He rearranged five of the boxes into packages of six highlighters each and sold them for $3 per package. He sold the rest of the highlighters separately at the rate of three pens for $2. How much did he make in total?"
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)
print(compressed_prompt)
Step 4: Advanced Usage with Specific Models
LLMLingua supports various models and configurations. For instance, you can use the LLMLingua-2 model for more efficient prompt compression:
pythonCopy codellm_lingua = PromptCompressor("microsoft/llmlingua-2-xlm-roberta-large-meetingbank", use_llmlingua2=True)
compressed_prompt = llm_lingua.compress_prompt(prompt, rate=0.33, force_tokens=['\n', '?'])
print(compressed_prompt)
Step 5: Structured Prompt Compression
For more advanced scenarios, you can split your text into sections and apply different compression rates:
pythonCopy codestructured_prompt = """<llmlingua, compress=False>Speaker 4:</llmlingua><llmlingua, rate=0.4> Thank you. And can we do the functions for content? Items I believe are 11, three, 14, 16 and 28, I believe.</llmlingua><llmlingua, compress=False>
Speaker 0:</llmlingua><llmlingua, rate=0.4> Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group in the City Manager Department by $200 to provide a contribution to the Friends of the Long Beach Public Library. Item 12 is communication from Councilman Super Now. Recommendation to increase appropriation in the special advertising and promotion fund group and the city manager's department by $10,000 to provide support for the end of summer celebration. Item 13 is a communication from Councilman Austin. Recommendation to increase appropriation in the general fund group in the city manager department by $500 to provide a donation to the Jazz Angels . Item 14 is a communication from Councilman Austin. Recommendation to increase appropriation in the general fund group in the City Manager department by $300 to provide a donation to the Little Lion Foundation. Item 16 is a communication from Councilman Allen recommendation to increase appropriation in the general fund group in the city manager department by $1,020 to provide contribution to Casa Korero, Sew Feria Business Association, Friends of Long Beach Public Library and Dave Van Patten. Item 28 is a communication. Communication from Vice Mayor Richardson and Council Member Muranga. Recommendation to increase appropriation in the general fund group in the City Manager Department by $1,000 to provide a donation to Ron Palmer Summit. Basketball and Academic Camp.</llmlingua><llmlingua, compress=False>
Speaker 4:</llmlingua><llmlingua, rate=0.6> We have a promotion and a second time as councilman served Councilman Ringa and customers and they have any comments.</llmlingua>"""
compressed_prompt = llm_lingua.structured_compress_prompt(structured_prompt, instruction="", question="", rate=0.5)
print(compressed_prompt['compressed_prompt'])
Conclusion
Microsoft’s LLMLingua provides a remarkable improvement in LLM inference efficiency, enabling up to 20x compression with minimal performance loss. This library is a game-changer for those looking to reduce costs and enhance performance in their AI applications.
For more details and advanced usage scenarios, visit the LLMLingua GitHub repository.
By integrating LLMLingua, you can seamlessly accelerate your language model applications and achieve substantial cost savings.