Groq refers to an AI infrastructure company that unveiled its groundbreaking LPU Inference Engine which delivers exceptional quality, compute speed and energy efficiency. Groq boasts 18x faster output tokens throughput based on Meta’s Llama 2 70B on Groq’s LPU Inference Engine, outperforming all cloud-based inference providers. Recently, they showed step-change speed benefits of their LPU chips based on 284 tokens per second on Llama 3 70B and 876 tokens per second on Llama 3 8B.
This article showcases rapid LLM inferencing based on Meta’s Llama3 model and Groq’s API.
Table of Content
- Understanding Groq’s LPU Inference Engine
- Benchmarks
- Using Groq API for Implementing Rapid LLM Inference Systems
Understanding Groq’s LPU Inference Engine
Groq invented a new processing system called LPU which stands for Language Processing Unit. Unlike traditional processors (CPUs or GPUs) that excel at parallel processing multiple tasks at the same time, the LPU adopts a different approach. It utilises a single-core architecture designed for sequential processing, which is ideal for tasks such as running LLMs that process information sequentially.
This focus on sequential processing allows the LPU to achieve:
Exception speed – By streamlining the processing flow, LPUs can handle LLMs much faster than traditional systems (CPUs or GPUs).
Reduced energy consumption – LPUs operate with lower power consumption since they don’t possess the overhead of managing and executing multiple tasks concurrently.
High accuracy – The streamlined processing also helps in maintaining accuracy even at lower precision levels.
Synchronous networking – Groq’s LPU maintains synchronous networking even in large-scale deployments with multiple LPUs which ensures smooth communication and data flow across the system.
Instant memory accessibility – LPU boasts instant access to memory, eliminating delays associated with traditional memory access patterns.
Groq’s LPU technology represents a significant advancement in LLM infrastructure. The combination of hardware and software allows for faster, more energy efficient and accurate LLM processing.
Benchmarks
Artificial Analysis provides an in-depth benchmark in the categories listed below:
Throughput vs. Price – Groq offers $0.64 per 1M tokens and achieves a throughput of 331 tokens per second for Llama 3 Instruct (70B) LLM.
Pricing (Input and Output) – Groq offers 0.59$ for input and 0.79$ for output, which is the lowest when compared with providers such as Deepinfra, OctoAI, Replicate, AWS, Fireworks, Perplexity and Azure.
Throughput, Over Time – Groq offers 284 tokens per second for Llama 3 Instruct (70B) which is 3-11x faster than other providers in this category.
Groq shows significant performance gains for LLM and the benchmarks show their LPU achieving up to 18x faster output throughput compared to cloud-based providers using traditional hardware.
Using Groq API for Implementing Rapid LLM Inference Systems
Step 1 – Log in to Groq’s Cloud Playground (https://console.groq.com/playground) and create an API key.
Step 2 – Installing and loading the relevant libraries
We need to install llama-index and llama-index-llms-groq to be able to operate on llms using groq and llama-index, along with it we will also install Groq python library for working with Groq individually and perform chat completion.
!pip install llama-index llama-index-llms-groq groq
from llama_index.llms.groq import Groq
from google.colab import userdata
import os
Step 3 – Defining the LLM model we will use to infer with Groq and passing Groq’s API for querying and printing the response. Here, we are using the Llama 3 70B model. Groq currently supports the following models:
Llama 3 8B (Model ID – llama3-8b-8192)
Llama 3 70B (Model ID – llama3-70b-8192)
Mixtral 8x7B (Model ID – mixtral-8x7b-32768)
Gemma 7B (Model ID – gemma-7b-it)
Whisper – Private beta only (Model ID – whisper-large-v3)
llm_groq = Groq(model="llama3-70b-8192", api_key=userdata.get("GROQ_API_KEY"))
Step 4 – Running a query and timing the execution to check execution time.
%%timeit
response_groq = llm_groq.complete("Write the summary of The Lord of the Rings in less than 200 words")
print(response_groq)
Output
Step 5 – Defining a chat message function with system and user templates. The system message defines what we want the system to act as (role), whereas the user message defines the user query to which the model responds to.
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a research assistant who helps in researching"
),
ChatMessage(role="user", content="What is meant by the Lord of the Rings?"),
]
resp = llm_groq.chat(messages)
Output
Groq’s API combined with llama-index orchestration generates inferences rapidly with proper quality and efficiency. Here, we were able to query the Llama 3 70B model in less than a second without any additional Hugging Face integration or model downloading. Groq can also be integrated with Vercel AI SDK and LangChain giving the users more flexibility in their generative AI application development and inferencing.
Final Words
While established players offer powerful hardware, Groq’s LPU seems to be specifically designed for the unique needs of LLMs. This focus translates to potential benefits in terms of speed, efficiency and accuracy for running LLMs. It’s important to note that Groq’s LPU technology is relatively new, and as the technology matures, it will only get bigger and better.
References
- Link to Code
- Groq’s LPU Inference Engine Documentation
- Groq’s LPU Whitepaper
- Groq Benchmarks
- Independent Analysis of AI-Language Models and API Providers
Enhance your knowledge in the disciplines of Generative AI through our hand-picked courses shown below: