In the evolving landscape of large language models (LLMs), optimizing prompts and model behavior is often crucial but labor-intensive. Traditional approaches require breaking down problems manually, tuning prompts step-by-step, and iteratively refining synthetic data for fine tuning—all of which can become chaotic when changes are introduced. Enter DSPy, a framework designed to make this process systematic and powerful by separating the program’s structure from its LLM parameters.
DSPy introduces LM-driven optimizers that automatically adjust prompts and weights based on defined metrics, creating reliable and adaptable LLM pipelines. Similar to how frameworks like PyTorch manage neural network parameters, DSPy offers modules and optimizers that eliminate manual prompt-tweaking, allowing developers to focus on building high-quality systems without wrestling with repetitive prompt engineering. In this blog, we’ll explore how DSPy transforms prompt and parameter optimization for LLMs, making it less cumbersome and more impactful.
Table of Content:
- Understanding DSPy for Optimizing Language Model Workflows
- Overview of DSPy Workflow
- Hands-on Implementation of DSPy
Let’s start with understanding DSPy in depth.
Understanding DSPy for Optimizing Language Model Workflows
The concept behind DSPy addresses a core issue in developing robust language model (LM) pipelines which is optimizing prompts and LM parameters separately from the programming logic. By introducing a “signature” system that encapsulates prompt best practices, DSPy aims to make prompt engineering both modular and systematic. Imagine a Retrieval-Augmented Generation (RAG) workflow, where prompt adjustments are typically done manually to improve accuracy. DSPy removes the burden of managing prompt engineering within code, instead letting developers focus on system logic while DSPy handles automatic prompt refinements and adjustments.
In essence, DSPy allows you to set high-level assertions and configurations, which it then optimizes automatically. For instance, in a binary question-answering task, rather than manually adjusting prompts to ensure binary responses, DSPy lets you assert that the answer should only be “yes” or “no.” If the LM deviates, DSPy backtracks and re-optimizes the prompt automatically to guide the model toward the desired output. Similarly, DSPy facilitates complex multi-step retrieval processes without the need for intricate prompt engineering, making it a powerful tool for building flexible LM pipelines.
However, DSPy as a framework is still evolving. Although the concept is promising, the current implementation has notable limitations: it lacks production readiness, has a steep learning curve due to its heavy reliance on meta-programming, and suffers from inadequate documentation. While DSPy simplifies prompt optimization theoretically, the code complexity can be a significant hurdle for users.
Overview of DSPy Workflow
DSPy employs a logical, five-step workflow tailored for language tasks, streamlining the process from data preparation to evaluation.
Workflow of DSPy
DSPy’s Workflow begins with the Dataset stage, where training data, such as blog posts, Q&A pairs, or other text data, is prepared and structured. The next step, Signature, establishes an input-output contract, clearly defining the task’s expected inputs and outputs. The Module (Pipeline) stage follows, where DSPy combines various operators to execute specific tasks, such as content generation or text analysis. In the Optimization phase, DSPy automatically fine-tunes parameters and prompts to enhance pipeline performance. Finally, Evaluation assesses pipeline effectiveness using metrics like accuracy and quality. This structured approach is very effective for various tasks, including content generation and automated content enhancement.
Hands-on Implementation of DSPy
Step 1: Setting up the environment
First, we’ll set up our development environment by installing necessary packages, configuring paths, and importing required libraries. This setup is specifically designed to work in Google Colab.
# Automatically reload modules in Colab
%load_ext autoreload
%autoreload 2
import sys
import os
# Clone DSPy repository if not already cloned, specific to Google Colab
repo_path = 'dspy'
if "google.colab" in sys.modules:
!git -C $repo_path pull origin || git clone https://github.com/stanfordnlp/dspy $repo_path
# Add repo_path to system path
if repo_path not in sys.path:
sys.path.append(repo_path)
# Set DSPy cache directory in Colab
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(repo_path, 'cache')
# Install DSPy and OpenAI packages if not already installed
import pkg_resources
required_packages = {"dspy-ai", "openai"}
installed_packages = {pkg.key for pkg in pkg_resources.working_set}
if not required_packages.issubset(installed_packages):
!pip install -U pip
!pip install dspy-ai==2.4.17 openai==0.28.1
# Import DSPy
import dspy
Step 2: Configuring LM and RM
Now we’ll configure our Language Model (GPT-3.5-turbo) and Retrieval Model (ColBERTv2). These will form the backbone of our RAG system.
turbo = dspy.OpenAI(model='gpt-3.5-turbo')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)
Step 3: Loading the dataset
We’ll use the HotpotQA dataset for training and evaluation. We’ll load a small subset for training (20 examples) and development (50 examples). This dataset consists of questions and answers
from dspy.datasets import HotPotQA
# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)
# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]
len(trainset), len(devset)
Step 4: Building Signatures
Signatures in DSPy define the interface for our LM calls. Here we’ll create a signature for generating answers that specifies input/output fields and their descriptions.
class GenerateAnswer(dspy.Signature):
"""Answer questions with short factoid answers."""
context = dspy.InputField(desc="may contain relevant facts")
question = dspy.InputField()
answer = dspy.OutputField(desc="often between 1 and 5 words")
Step 5: Building the Pipeline
Let’s create our RAG pipeline by combining retrieval and answer generation into a single module. This pipeline will retrieve relevant passages and generate answers.
class RAG(dspy.Module):
def __init__(self, num_passages=3):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
def forward(self, question):
context = self.retrieve(question).passages
prediction = self.generate_answer(context=context, question=question)
return dspy.Prediction(context=context, answer=prediction.answer)
Step 6: Optimizing the Pipeline
Using DSPy’s Teleprompter, we’ll optimize our pipeline by automatically learning effective prompts for its modules through few-shot learning.
from dspy.teleprompt import BootstrapFewShot
# Validation logic: check that the predicted answer is correct.
# Also check that the retrieved context does actually contain that answer.
def validate_context_and_answer(example, pred, trace=None):
answer_EM = dspy.evaluate.answer_exact_match(example, pred)
answer_PM = dspy.evaluate.answer_passage_match(example, pred)
return answer_EM and answer_PM
# Set up a basic teleprompter, which will compile our RAG program.
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)
# Compile!
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)
Step 7: Executing the Pipeline
Now we can test our optimized RAG system with a sample question to see how it performs in practice.
# Ask any question you like about this simple RAG program.
my_question = "What castle did David Gregory inherit?"
# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_rag(my_question)
# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")
Step 8: Evaluating the Pipeline
Let’s evaluate our pipeline’s overall performance using exact match metrics on our development set.
from dspy.evaluate.evaluate
import Evaluate
# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=5)# Evaluate the `compiled_rag` program with the `answer_exact_match`
metric.metric = dspy.evaluate.answer_exact_matchevaluate_on_hotpotqa(compiled_rag, metric=metric)
Step 9: Evaluating the Retrieval
Finally, we’ll specifically evaluate the retrieval component by checking if our system finds the gold (correct) passages for each question.
def gold_passages_retrieved(example, pred, trace=None):
gold_titles = set(map(dspy.evaluate.normalize_text, example['gold_titles']))
found_titles = set(map(dspy.evaluate.normalize_text, [c.split(' | ')[0] for c in pred.context]))
return gold_titles.issubset(found_titles)
compiled_rag_retrieval_score = evaluate_on_hotpotqa(compiled_rag, metric=gold_passages_retrieved)
Output –
Follow the following format.
Context: may contain relevant facts
Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: often between 1 and 5 words
Context:
[1] «Rosario Dawson | Rosario Isabel Dawson (born May 9, 1979) is an American actress, producer, singer, comic book writer, and political activist. She made her film debut in the 1995 teen drama "Kids". Her subsequent film roles include "He Got Game", "Men in Black II", "25th Hour", "Rent", "Sin City", "Death Proof", "Seven Pounds", "", and "Top Five". Dawson has also provided voice-over work for Disney and DC.»
[2] «Sarai Gonzalez | Sarai Isaura Gonzalez (born 2005) is an American Latina child actress who made her professional debut at the age of 11 on the Spanish-language ""Soy Yo"" ("That's Me") music video by Bomba Estéreo. Cast as a "nerdy" tween with a "sassy" and "confident" attitude, her performance turned her into a "Latina icon" for "female empowerment, identity and self-worth". She subsequently appeared in two get out the vote videos for Latinos in advance of the 2016 United States elections.»
[3] «Gabriela (2001 film) | Gabriela is a 2001 American romance film, starring Seidy Lopez in the title role alongside Jaime Gomez as her admirer Mike. The film has been cited as an inspiration behind the Premiere Weekend Club, which supports Latino film-making.»
Question: Which American actress who made their film debut in the 1995 teen drama "Kids" was the co-founder of Voto Latino?
Reasoning: Let's think step by step in order to produce the answer. We know that the actress made her film debut in 1995 and co-founded Voto Latino.
Answer: Rosario Dawson
Final Words
In summary, DSPy presents a promising approach to optimizing complex language model workflows by separating prompt engineering from programming logic, bringing structure and automation to what is traditionally a manual, iterative process. While the framework is still evolving and has some limitations, its foundational ideas—like automatic prompt adjustments, module settings, and assertion-based backtracking—are innovative steps towards a more robust and scalable way to develop LM-based applications. DSPy simplifies the integration of LLMs into pipelines, making it easier to experiment, refine, and deploy, especially as the tool matures and becomes more production-ready.