DSPy

Auto-prompt-engineering with LLM

Xin Cheng
7 min readJun 26, 2024

When working with LLMs, a prompt engineer usually has love-and-hate relationship with prompt. A perfect prompt can generate what you want. However, perfect prompt usually does not exist with complex tasks. You need to iterate and constantly refine your prompts, and that is still fragile. As a developer, you want to automate things, not write natural languages. If you can find good samples, you can try DSPy to write prompt for you (at least initially). Below are articles to get you started.

Name origin: “Demonstrate-Search-Predict” (originally). Purpose: build robust applications that leverage the power of LLMs models without getting bogged down in the complexities of prompt engineering and model fine-tuning.

  • Prompt Wrappers: very thin wrapper for prompt templating.
  • Application Development Libraries: LangChain, LlamaIndex
  • Generation Control Libraries: Guidance, LMQL, RELM, Outlines
  • Prompt Generation & Automation: DSPy

The article uses zero-shot as the simplest DSPy sample.

class ZeroShot(dspy.Module):
"""
Provide answer to question
"""
def __init__(self):
super().__init__()
self.prog = dspy.Predict("question -> answer")

def forward(self, question):
return self.prog(question="In the game of bridge, " + question)

Write a subclass of dspy.Module; init method, set up a LM module with single call to dspy.Predict: one input (question) and one output (answer) as the signature; forward() method runs inference on passed-in question.

Setup global LLM as Google Gemini

gemini = dspy.Google("models/gemini-1.0-pro",
api_key=api_key,
temperature=temperature)
dspy.settings.configure(lm=gemini, max_tokens=1024)

Inference

module = ZeroShot()
response = module("What is Stayman?")
print(response)

RAG (retrievers support)

from chromadb.utils import embedding_functions
default_ef = embedding_functions.DefaultEmbeddingFunction()
bidding_rag = ChromadbRM(CHROMA_COLLECTION_NAME, CHROMADB_DIR, default_ef, k=3)

Multi-module is just specify order of using modules in forward method. (is there way to define nonlinear orchestration?)

Then article shows how you can use examples (question/answer pair) to let DSPy to tune prompt like below.

DSPy is a framework developed by Stanford University that can automatically optimize LLM prompts and weights.

Modules: support ReAct, ChainofThought, ProgramofThought, etc.

like below, we define RAG module with retriever and generator (all components are defined in __init__ function and the way to run inference is defined in forward function)

class RagModule(dspy.Module): 
def __init__(self, num_passages=3):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
def forward(self, question):
context = self.retrieve(question).passages
prediction = self.generate_answer(context=context, question=question)
return dspy.Prediction(context=context, answer=prediction.answer)

Optimizers: The optimizer is a component that automatically evaluates generated responses and searched context and optimizes prompts and weights accordingly, like BootstrapFewShot, BootstrapFewShotWithRandomSearch, BayesianSignatureOptimizer, depending on how many task examples you have

like below we define evaluation metric and BootstrapFewShot optimizer, then compile the module

def validate_context_and_answer(example, pred, trace=None):
answer_EM = dspy.evaluate.answer_exact_match(example, pred)
answer_PM = dspy.evaluate.answer_passage_match(example, pred)
return answer_EM and answer_PM

teleprompter = BootstrapFewShot(metric=validate_context_and_answer)

compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

Signature: Signatures are components for defining the structure of inputs and outputs in RAG applications.

class GenerateAnswer(dspy.Signature):
context = dspy.InputField(desc="may contain relevant facts")
question = dspy.InputField()
answer = dspy.OutputField(desc="often between 1 and 5 words")

vs. LangChain or LlamaIndex: DSPy shifts the construction of LM-based pipelines from operating prompts to being closer to programming.

Pipeline execution

my_question = "What castle did David Gregory inherit?"
pred = compiled_rag(my_question)

print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")dfdfdf

Pipeline evaluation

from dspy.evaluate.evaluate import Evaluate

# `evaluate_on_hotpotqa`
evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=1, display_progress=False, display_table=5)


metric = dspy.evaluate.answer_exact_match
evaluate_on_hotpotqa(compiled_rag, metric=metric)

Retriever evaluation

def gold_passages_retrieved(example, pred, trace=None):
gold_titles = set(map(dspy.evaluate.normalize_text, example['gold_titles']))
found_titles = set(map(dspy.evaluate.normalize_text, [c.split(' | ')[0] for c in pred.context]))

return gold_titles.issubset(found_titles)

compiled_rag_retrieval_score = evaluate_on_hotpotqa(compiled_rag, metric=gold_passages_retrieved)

Disadvantages are:

  1. Supports English only. A feature of DSPy is that you don’t have to write prompts, but the instructions behind the scenes are in English. Sentiment analysis was possible in Spanish or Chinese, but it’s unclear whether it can handle other complex tasks.
  2. Does not support complex tasks. Normally, when using GPT, you need to write a more detailed prompt, but in DSPy, you cannot edit the prompt. So if you don’t have good input/output examples, you cannot use DSPy since it cannot guess what you want.

Prompt engineering automation

  1. Bootstrapping: Starting with an initial seed prompt, DSPy iteratively refines it based on the LM’s outputs and user-provided examples/assertions
  2. Prompt Chaining: Breaking down complex tasks into a sequence of simpler sub-prompts
  3. Prompt Ensembling: Combining multiple prompt variations to improve performance

It is suitable for applications that you can easily gather sample inputs and outputs.

Suitable for cases to share a very small amount of data, and have DSPy generate the initial prompts, prompt templates and prompting strategies.

Auto-prompt (like analogy of providing data and let machine learning to write program for you)

A few more details on optimizers

Boostrap fewshot optimizer: Uses a teacher LM to select the best demonstrations to include for the prompt from a larger set of demonstrations provided by the user.

COPRO optimizer: Finds the best-performing instruction for the model. Starts with a set of initial instructions, generates variations of those instructions, evaluates each variation and finally returns the best performing instruction.

MIPRO optimizer: Finds the best-performing combination of instruction and demonstrations. Working similarly to the COPRO optimizer, it returns the best-performing combination of instructions and examples.

DSPy without CoT +12% improvement over manual prompt and cost a total of less than $0.50.

Prompt engineering challenges

  1. Manual prompt engineering which does not generalize well: LLMs are highly sensitive to how they are prompted for each task
  2. Lack framework to conduct testing

Inspect generated prompt

# lm is dspy.OpenAI, dspy.Databricks, etc.
lm.inspect_history(n=1)

Integration with other products

Set retriever model with Qdrant vector search engine

from dspy.retrieve.qdrant_rm import QdrantRM
qdrant_retriever_model = QdrantRM("customer_service", client, k=10)
dspy.settings.configure(lm=llm, rm=qdrant_retriever_model)
  1. Support Ollama through ‘dspy.OllamaLocal’

Shows a way to define custom retriever in DSPy module.

Define haystack retriever

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders import PromptBuilder
from haystack import Pipeline


retriever = InMemoryBM25Retriever(document_store, top_k=3)

Use retrieve method to use haystack retriever (the effect is actually defining self.retrieve = in __init__ method, but since there are more lines, so packaged in a method)

class RAG(dspy.Module):
def __init__(self):
super().__init__()
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

# this makes it possible to use the Haystack retriever
def retrieve(self, question):
results = retriever.run(query=question)
passages = [res.content for res in results['documents']]
return Prediction(passages=passages)

def forward(self, question):
context = self.retrieve(question).passages
prediction = self.generate_answer(context=context, question=question)
return dspy.Prediction(context=context, answer=prediction.answer)

Integration with LlamaIndex

Quite loose integration. Use LlamaIndex VectorStoreIndex (which is abstraction over vector databases) as the retriever in DSPy module.

Integration with Langchain

  1. Use DSPy as retriever in langchain
  2. Use langchain chain into DSPy module
# From DSPy, import the modules that know how to interact with LangChain LCEL.
from dspy.predict.langchain import LangChainPredict, LangChainModule

# This is how to wrap it so it behaves like a DSPy program.
# Just Replace every pattern like `prompt | llm` with `LangChainPredict(prompt, llm)`.
zeroshot_chain = RunnablePassthrough.assign(context=retrieve) | LangChainPredict(prompt, llm) | StrOutputParser()
zeroshot_chain = LangChainModule(zeroshot_chain) # then wrap the chain in a DSPy module.

Appendix

DSPy can have better namespace organization, e.g. place all LLMs under a specific Python namespace.

--

--

Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified