Implementing Rapid LLM Inferencing using Groq

Discover and implement Groq's API for faster LLM inferencing with exceptional speed and efficiency.

Groq refers to an AI infrastructure company that unveiled its groundbreaking LPU Inference Engine which delivers exceptional quality, compute speed and energy efficiency. Groq boasts 18x faster output tokens throughput based on Meta’s Llama 2 70B on Groq’s LPU Inference Engine, outperforming all cloud-based inference providers. Recently, they showed step-change speed benefits of their LPU chips based on 284 tokens per second on Llama 3 70B and 876 tokens per second on Llama 3 8B. 

This article showcases rapid LLM inferencing based on Meta’s Llama3 model and Groq’s API. 

Table of Content

  1. Understanding Groq’s LPU Inference Engine
  2. Benchmarks
  3. Using Groq API for Implementing Rapid LLM Inference Systems

Understanding Groq’s LPU Inference Engine

Groq invented a new processing system called LPU which stands for Language Processing Unit. Unlike traditional processors (CPUs or GPUs) that excel at parallel processing multiple tasks at the same time, the LPU adopts a different approach. It utilises a single-core architecture designed for sequential processing, which is ideal for tasks such as running LLMs that process information sequentially. 

Groq’s LPU Whitepaper

This focus on sequential processing allows the LPU to achieve: 

Exception speed – By streamlining the processing flow, LPUs can handle LLMs much faster than traditional systems (CPUs or GPUs). 

Reduced energy consumption – LPUs operate with lower power consumption since they don’t possess the overhead of managing and executing multiple tasks concurrently. 

High accuracy – The streamlined processing also helps in maintaining accuracy even at lower precision levels. 

Synchronous networking – Groq’s LPU maintains synchronous networking even in large-scale deployments with multiple LPUs which ensures smooth communication and data flow across the system. 

Instant memory accessibility – LPU boasts instant access to memory, eliminating delays associated with traditional memory access patterns. 

Groq’s LPU technology represents a significant advancement in LLM infrastructure. The combination of hardware and software allows for faster, more energy efficient and accurate LLM processing. 

Benchmarks

Artificial Analysis provides an in-depth benchmark in the categories listed below: 

Throughput vs. Price – Groq offers $0.64 per 1M tokens and achieves a throughput of 331 tokens per second for Llama 3 Instruct (70B) LLM. 

Pricing (Input and Output) – Groq offers 0.59$ for input and 0.79$ for output, which is the lowest when compared with providers such as Deepinfra, OctoAI, Replicate, AWS, Fireworks, Perplexity and Azure. 

Throughput, Over Time – Groq offers 284 tokens per second for Llama 3 Instruct (70B) which is 3-11x faster than other providers in this category. 

Artificialanalysis.ai

Key Comparison Metrics

Groq shows significant performance gains for LLM and the benchmarks show their LPU achieving up to 18x faster output throughput compared to cloud-based providers using traditional hardware. 

Using Groq API for Implementing Rapid LLM Inference Systems

Step 1 –  Log in to Groq’s Cloud Playground (https://console.groq.com/playground) and create an API key.  

Step 2 – Installing and loading the relevant libraries 

We need to install llama-index and llama-index-llms-groq to be able to operate on llms using groq and llama-index, along with it we will also install Groq python library for working with Groq individually and perform chat completion. 

!pip install llama-index llama-index-llms-groq groq

from llama_index.llms.groq import Groq
from google.colab import userdata
import os

Step 3 – Defining the LLM model we will use to infer with Groq and passing Groq’s API for querying and printing the response. Here, we are using the Llama 3 70B model. Groq currently supports the following models: 

Llama 3 8B (Model ID – llama3-8b-8192)

Llama 3 70B (Model ID – llama3-70b-8192)

Mixtral 8x7B (Model ID – mixtral-8x7b-32768)

Gemma 7B (Model ID – gemma-7b-it)

Whisper – Private beta only (Model ID – whisper-large-v3)

llm_groq = Groq(model="llama3-70b-8192", api_key=userdata.get("GROQ_API_KEY"))

Step 4 – Running a query and timing the execution to check execution time. 

%%timeit

response_groq = llm_groq.complete("Write the summary of The Lord of the Rings in less than 200 words")

print(response_groq)

Output

Step 5 – Defining a chat message function with system and user templates. The system message defines what we want the system to act as (role), whereas the user message defines the user query to which the model responds to.  

from llama_index.core.llms import ChatMessage

messages = [
   ChatMessage(
       role="system", content="You are a research assistant who helps in researching"
   ),
   ChatMessage(role="user", content="What is meant by the Lord of the Rings?"),
]

resp = llm_groq.chat(messages)

Output 

Groq’s API combined with llama-index orchestration generates inferences rapidly with proper quality and efficiency. Here, we were able to query the Llama 3 70B model in less than a second without any additional Hugging Face integration or model downloading. Groq can also be integrated with Vercel AI SDK and LangChain giving the users more flexibility in their generative AI application development and inferencing. 

Final Words

While established players offer powerful hardware, Groq’s LPU seems to be specifically designed for the unique needs of LLMs. This focus translates to potential benefits in terms of speed, efficiency and accuracy for running LLMs. It’s important to note that Groq’s LPU technology is relatively new, and as the technology matures, it will only get bigger and better.  

References

  1. Link to Code
  2. Groq’s LPU Inference Engine Documentation
  3. Groq’s LPU Whitepaper
  4. Groq Benchmarks
  5. Independent Analysis of AI-Language Models and API Providers

Enhance your knowledge in the disciplines of Generative AI through our hand-picked courses shown below:

Picture of Sachin Tripathi

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.

Subscribe to our Newsletter