A Practical Guide to Tracing and Evaluating LLMs Using LangSmith

Explore LangSmith, a platform for enhancing LLM transparency and performance through comprehensive tracing and evaluation tools.

Large Language Models have demonstrated exceptional abilities in natural language processing, but their internal operations remain unclear, which leads to problems due to a lack of transparency and model explainability. This lack of transparency can make it difficult for users to understand the decision-making process or ensure optimal performance when developing applications through LLMs. This article explores LangSmith, an all-in-one platform based on LangChain’s ecosystem for Tracing and Evaluating LLMs. LangSmith provides a comprehensive toolbox for assessing and evaluating LLMs’ workings and overall performance based on user-specific datasets, evaluation methods, and tasks. 

Table of Content

  1. Understanding Tracing and Evaluation in LLMs
  2. Logging LLM activities using LangSmith
  3. Types of Trace Data
  4. Significance of LLM Evaluation
  5. Evaluation Chains in LangSmith
  6. Langsmith for Practical LLM Tracing and Evaluation

Understanding Tracing and Evaluation in LLMs

Tracing is a method for understanding the LLM application’s behaviour based on capturing technical parameters such as the number of requests, response time, token usage and costs, error rates, etc. Tracing plays a vital role in LLM Explainability and Observability by tracking and presenting the involved parameters for user understanding.  

LLM Tracing using LangSmith

Evaluation in LLMs, on the other hand, assesses performance against a specific criteria, task, or dataset. This process uses key benchmarks such as TruthfulQA, MMLU, and GLUE Benchmark and metrics based on performance, user feedback, quality of response, costs, etc. 

Image credit – LLMs and Explainable AI

Logging LLM activities using LangSmith

LangSmith offers a comprehensive platform for logging LLM activities, which is referred to as Tracing. Tracing helps users understand and observe the LLM functionality. 

LangSmith uses LANGCHAIN_TRACING_V2 environment variables for integrating with LLM applications built on LangChain. @traceable decorator and traceable function (python) from LangSmith can be used to wrap LLM functions and activate tracing. 

Types of Trace Data

LangSmith can capture different types of trace data, which are listed below: 

  1. Inputs and Outputs 

IO data consists of prompts, queries, and instructions passed into the LLM and the output responses generated, making it easy for users to analyse the LLM’s interpretation. 

  1. Model Configuration Details

LangSmith can capture information about LLM models being used along with their configuration settings, enabling the user to understand how the model and its settings influence the LLM’s behaviour. 

  1. Latency

Latency measures the time the LLM takes to process information and generate a response. This trace data can help users monitor LLM performance and identify bottlenecks. 

  1. Token Usage and Costs

Trace data also consists of token usage and cost analysis which is very important in monitoring LLM costs and enabling efficient budgets. 

  1. Execution Flow Data and Nested Traces 

LangSmith logs the function call sequences, sub-tasks and chained operations providing a detailed hierarchical view of the LLM’s decision-making process.  

  1. Timestamps

Each trace entry in LangSmith is assigned a timestamp to enable users to analyse the LLM based on the temporal sequence of events. 

  1. Error Messages 

LangSmith logs error messages encountered during LLM execution to assist users in identifying causes and solutions within the LLM application. 

Significance of LLM Evaluation

LLM evaluation is one of the most important steps that ensures smooth LLM functionality and reliability. Without a proper evaluation strategy, the LLM application might respond with biased outputs or exhibit misinterpretation, and hallucinations which decrease the overall reliability of LLMs. 

Evaluation in LLM can be broadly classified into different tiers based on performance, fairness and bias, explainability, interpretability, robustness, scalability and efficiency. 

  1. Performance metrics include accuracy, relevance, and specificity which measure how often the LLM’s output is correct, aligned with the prompt and focused. 
  2. Fairness and bias metrics assess whether the LLM exhibits favoured responses based on factors such as race, gender or social background. 
  3. Explainability and interpretability assess if the LLM’s output is trustable and users can understand how the output was generated. 
  4. Robustness and scalability correspond to the efficiency of LLMs in handling large datasets and whether the LLM can perform well when the data load or user requests are increasing. 

Image Credit – RAG Evaluations

Evaluation Chains in LangSmith

This section discusses the different evaluation chains under LangSmith utilised in LLM evaluation: 

Evaluation ChainDescription
CorrectnessMeasures factual accuracy of the LLM’s outputs.
ConcisenessEvaluates how brief and to the point the LLM’s response is.
RelevanceAssesses how well the LLM’s response aligns with the prompt or question.
CoherenceEvaluates the logical flow and internal consistency of the LLM’s generated text.
HarmfulnessIdentifies potential for the LLM’s outputs to cause harm (physical or emotional).
MaliciousnessDetects if the LLM’s outputs intend to cause harm or mislead.
ControversialityEvaluates the potential for the LLM’s outputs to spark disagreement or offence.
MisogynyIdentifies outputs containing gender bias against women.
CriminalityDetects if the LLM’s outputs promote criminal activity.
InsensitivityEvaluates the LLM’s outputs for lack of sensitivity to specific topics or groups.
DepthAssesses the comprehensiveness and richness of the LLM’s response.
CreativityMeasures the LLM’s ability to generate original and unexpected ideas.
DetailEvaluates the level of detail and elaboration provided in the LLM’s response.

Langsmith for Practical LLM Tracing and Evaluation

Requirements – OpenAI API and LangSmith’s API are needed for this practical guide. Visit the link https://smith.langchain.com/ and create an API key. 

The home page of LangSmith shows an overview of projects, datasets, and prompts. 

Visit the settings page to create an API key. 

Step 1: Install the necessary libraries for operating with Langsmith

!pip install langsmith langchain langchain_openai  langchain_core python-dotenv

Step 2: Import ChatOpenAI for chat completion, ChatPromptTemplate for crafting prompt templates, StrOutputParser for parsing the generated response, and LangSmith for connecting with web UI and checking the results. 

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langsmith import Client
from google.colab import userdata
import os

Step 3: Pass the access API Keys for connecting with OpenAI and LangChain. 

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_APIKEY")
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = userdata.get("LANGCHAIN_API_KEY")

Step 4: Craft a prompt template for understanding user questions and responding based on the provided context, configure the OpenAI’s GPT-3.5-Turbo model and define the string output parser. 

prompt = ChatPromptTemplate.from_messages(
   [
       ("system", "You are an assistant who responds to user requests using context"),
       ("user","Question:{question}\nContext:{context}")
   ]
)

model = ChatOpenAI(model="gpt-3.5-turbo")
output_parser = StrOutputParser()

Step 5: Create a chain by connecting the 3 main components – prompt, model and output_parser. 

chain=prompt|model|output_parser

Step 6: Provide the list of questions and context for response generation. 

question = ["What is the Game of Thrones", "Who are the main characters in the Game of Thrones", "Who dies in the Game of Thrones"]

context = """

Step 7: Execute the chain for creating a trace on LangSmith’s web UI. 

for q in question:
 print(chain.invoke({"question":q,"context":context}))
 print("\n")

Output:

Step 8: Create a custom dataset for LLM Evaluation based on RAG parameters. 

correct_answers = [
   "A Game of Thrones is the first novel in A Song of Ice and Fire, a series of fantasy novels by American author George R. R. Martin.",
   "Here are the main character names mentioned as per the context - Ned Stark, Robert Baratheon, Jon Arryn,  \
   Cersei Lannister, Jaime Lannister, Bran Stark, Catelyn Stark, Jon Snow, Tyrion Lannister, Daenerys Targaryen, \
   Viserys Targaryen, Khal Drogo, Ser Jorah Mormont, Joffrey Baratheon, Sansa Stark, Arya Stark, Petyr Baelish, \
   Lysa Arryn, Stannis Baratheon, Tywin Lannister, Robb Stark",
   "Based on the context, these characters die - Jon Arryn, Viserys Targaryen, Robert Baratheon, Ned Stark",
]

client = Client()

inputs = [
   (question[0], correct_answers[0]),
   (question[1], correct_answers[1]),
   (question[2], correct_answers[2])
]
dataset_name = "LS_EVAL"

dataset = client.create_dataset(
   dataset_name=dataset_name, description="Questions and answers for evaluating",
)

for input_prompt, output_answer in inputs:
   client.create_example(
       inputs={"question": input_prompt},
       outputs={"answer": output_answer},
       dataset_id=dataset.id,
   )

Step 9: Check LangSmith’s web UI (Datasets section) 

You’ll find the dataset created based on the input and output examples as given in the Python code. 

Click on the example to view it in detail. 

Step 10: Go to the Projects section to check the trace. 

You’ll find different trace data and their values in the trace runs. 

Check one individual trace data to list the execution steps along with their traced elements. 

You can also check your input, prompt and output in the ChatOpenAI trace. 

Step 11: Visit the Datasets section and create an Evaluator. This evaluator will be used in our prompt for evaluation purposes.

Select ChatOpenAI as the provider, GPT-3.5-Turbo Model. Configure temperature value as your interest and select Create a prompt from scratch option. 

You will be able to see the prompt template that the evaluator will operate on. Here, we are using Input, Output and Reference context for evaluations. 

Step 12: Once the evaluator is created, Visit the prompt section and create a new prompt execution environment. 

Click on Prompt to start the playground

Select your dataset (LS_EVAL) in the prompt playground, configure the OpenAI model as per your interest and click on start.

The execution begins and you can see the generated outputs. 

Step 13: Visit the experiments page to check your evaluator run and its output in detail. 

Step 14: You can see the correctness score (1 being correct and 0 being incorrect) which signifies whether the generated output matches our evaluation dataset’s correct output. 

This concludes this hands-on guide on using LangSmith for Tracing and Evaluating LLM. LangSmith’s platform provides all the necessary tools for observing and assessing our LLM application. Through the trace, we were able to understand the technical parameters and the execution flows, and the evaluator helped us find out if our LLM model was able to generate correct results or not. 

Final Words

LangSmith is an excellent choice for implementing observability and explainability in large language models. Using the tracing, evaluation tools, datasets and prompt playground users can understand, assess and improve their LLM operations easily and efficiently. Apart from LangSmith, there are some other exceptional tools for LLM tracing and evaluations such as Arize’s Phoenix, Microsoft’s Prompt Flow, OpenTelemetry and Langfuse, which are worth exploring. 

Reference

  1. Link to the above code
  2. Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era
  3. Explainability for Large Language Models: A Survey
  4. LangSmith Documentation
  5. Defining and Understand LLM Evaluation Metrics (Microsoft)

Picture of Sachin Tripathi

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.