DeepEval is an open-source framework for testing and evaluating LLM performance. With a Pytest-like architecture, it provides developers with comprehensive tools for unit testing model outputs and offers 14+ research-backed metrics for different use cases. The framework supports advanced synthetic dataset generation and real-time production monitoring, helping teams optimize their LLM applications across various implementations. In this guide, we’ll create a practical test case to demonstrate DeepEval’s testing capabilities.
Table of Content
- What is DeepEval?
- Practical Implementation Steps
- Other DeepEval’s Metrics
What is DeepEval?
DeepEval is a comprehensive open-source framework that performs LLM evaluation by bringing Pytest-like simplicity to testing language model outputs. Running entirely on local infrastructure, it offers an extensive suite of evaluation metrics, including G-Eval, hallucination detection, and RAGAS, powered by any LLM of choice.
The framework excels at both basic testing and advanced use cases, from optimizing RAG pipelines to red-teaming for safety vulnerabilities across 40+ scenarios. Whether you’re benchmarking models against popular standards like MMLU and HumanEval, transitioning between LLM providers, or running parallel evaluations in CI/CD environments, DeepEval streamlines the entire process into just a few lines of code.
Practical Implementation Steps
Step 1: Install Required Packages
Begin by installing the necessary libraries for evaluation and LLM integration. This includes deepeval, huggingface_hub, and lm-format-enforcer.
pip install -U deepeval
pip install huggingface_hub
pip install lm-format-enforcer
pip install transformers
Step 2: Authenticate with Hugging Face Hub
Authenticate your session to access Hugging Face models securely.
from huggingface_hub import login
login()
Step 3: Define a Custom LLM Class
We implement a custom LLM class (CustomLlama) using the DeepEval framework, integrating a pre-trained Llama model from Meta. This class also uses lm-format-enforcer to enforce schema conformity for JSON outputs.
import json
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from pydantic import BaseModel
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn
from deepeval.models import DeepEvalBaseLLM
class CustomLlama(DeepEvalBaseLLM):
def __init__(self):
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
self.model = model
self.tokenizer = tokenizer
def load_model(self):
return self.model
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
pipeline = transformers.pipeline(
"text-generation",
model=self.model,
tokenizer=self.tokenizer,
max_length=2500,
do_sample=True,
top_k=5,
)
parser = JsonSchemaParser(schema.schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(pipeline.tokenizer, parser)
output_dict = pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
output = output_dict[0]["generated_text"][len(prompt):]
json_result = json.loads(output)
return schema(**json_result)
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)
def get_model_name(self):
return "Llama"
Step 4: Initialize the Custom Model
Instantiate your custom LLM class.
custom_llm = CustomLlama()
Step 5: Log in to DeepEval
Set up your DeepEval credentials.
!deepeval login
Step 6: Evaluate Using Custom Metrics
Use the AnswerRelevancyMetric to assess how well the model’s response aligns with user expectations. Define your test cases with the LLMTestCase class.
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
actual_output = "We offer a 30-day full refund at no extra cost."
metric = AnswerRelevancyMetric(
threshold=0.7,
model=custom_llm,
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
expected_output="You're eligible for a 30 day refund at no extra cost.",
actual_output=actual_output,
context=["All customers are eligible for a 30 day full refund at no extra cost."],
retrieval_context=["Only shoes can be refunded."]
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
Output:
1.0
The score is 1.00 because the shoes don't fit in the actual output
Step 7: Batch Evaluation
Evaluate multiple test cases together using EvaluationDataset.
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
first_test_case = LLMTestCase(input="...", actual_output="...")
second_test_case = LLMTestCase(input="...", actual_output="...")
test_cases = [first_test_case, second_test_case]
dataset = EvaluationDataset(test_cases=test_cases)
Other DeepEval’s Metrics
DeepEval offers a variety of metrics to evaluate LLM outputs based on specific criteria, such as G-Eval, Summarization, Faithfulness, Answer Relevancy, Contextual Relevancy, and more. It includes both non-conversational metrics, which are used to assess LLMTestCases, and conversational metrics, like Conversation Completeness and Knowledge Retention, for evaluating dialogues.
Users can also create custom evaluation metrics tailored to their needs. These metrics are versatile, easy to implement, and output scores between 0-1, requiring a threshold for evaluation success.
Final Words
In conclusion, DeepEval simplifies and enhances the LLM testing process by providing a wide range of robust, reliable metrics, including both default and customizable options. With its seamless integration and versatile evaluation capabilities, it ensures consistent, high-quality results across various LLM tasks. Whether you’re evaluating individual outputs or conversational interactions, DeepEval offers the tools to measure and improve performance efficiently.