Large Language Models have demonstrated exceptional abilities in natural language processing, but their internal operations remain unclear, which leads to problems due to a lack of transparency and model explainability. This lack of transparency can make it difficult for users to understand the decision-making process or ensure optimal performance when developing applications through LLMs. This article explores LangSmith, an all-in-one platform based on LangChain’s ecosystem for Tracing and Evaluating LLMs. LangSmith provides a comprehensive toolbox for assessing and evaluating LLMs’ workings and overall performance based on user-specific datasets, evaluation methods, and tasks.
Table of Content
- Understanding Tracing and Evaluation in LLMs
- Logging LLM activities using LangSmith
- Types of Trace Data
- Significance of LLM Evaluation
- Evaluation Chains in LangSmith
- Langsmith for Practical LLM Tracing and Evaluation
Understanding Tracing and Evaluation in LLMs
Tracing is a method for understanding the LLM application’s behaviour based on capturing technical parameters such as the number of requests, response time, token usage and costs, error rates, etc. Tracing plays a vital role in LLM Explainability and Observability by tracking and presenting the involved parameters for user understanding.
LLM Tracing using LangSmith
Evaluation in LLMs, on the other hand, assesses performance against a specific criteria, task, or dataset. This process uses key benchmarks such as TruthfulQA, MMLU, and GLUE Benchmark and metrics based on performance, user feedback, quality of response, costs, etc.
Image credit – LLMs and Explainable AI
Logging LLM activities using LangSmith
LangSmith offers a comprehensive platform for logging LLM activities, which is referred to as Tracing. Tracing helps users understand and observe the LLM functionality.
LangSmith uses LANGCHAIN_TRACING_V2 environment variables for integrating with LLM applications built on LangChain. @traceable decorator and traceable function (python) from LangSmith can be used to wrap LLM functions and activate tracing.
Types of Trace Data
LangSmith can capture different types of trace data, which are listed below:
- Inputs and Outputs
IO data consists of prompts, queries, and instructions passed into the LLM and the output responses generated, making it easy for users to analyse the LLM’s interpretation.
- Model Configuration Details
LangSmith can capture information about LLM models being used along with their configuration settings, enabling the user to understand how the model and its settings influence the LLM’s behaviour.
- Latency
Latency measures the time the LLM takes to process information and generate a response. This trace data can help users monitor LLM performance and identify bottlenecks.
- Token Usage and Costs
Trace data also consists of token usage and cost analysis which is very important in monitoring LLM costs and enabling efficient budgets.
- Execution Flow Data and Nested Traces
LangSmith logs the function call sequences, sub-tasks and chained operations providing a detailed hierarchical view of the LLM’s decision-making process.
- Timestamps
Each trace entry in LangSmith is assigned a timestamp to enable users to analyse the LLM based on the temporal sequence of events.
- Error Messages
LangSmith logs error messages encountered during LLM execution to assist users in identifying causes and solutions within the LLM application.
Significance of LLM Evaluation
LLM evaluation is one of the most important steps that ensures smooth LLM functionality and reliability. Without a proper evaluation strategy, the LLM application might respond with biased outputs or exhibit misinterpretation, and hallucinations which decrease the overall reliability of LLMs.
Evaluation in LLM can be broadly classified into different tiers based on performance, fairness and bias, explainability, interpretability, robustness, scalability and efficiency.
- Performance metrics include accuracy, relevance, and specificity which measure how often the LLM’s output is correct, aligned with the prompt and focused.
- Fairness and bias metrics assess whether the LLM exhibits favoured responses based on factors such as race, gender or social background.
- Explainability and interpretability assess if the LLM’s output is trustable and users can understand how the output was generated.
- Robustness and scalability correspond to the efficiency of LLMs in handling large datasets and whether the LLM can perform well when the data load or user requests are increasing.
Image Credit – RAG Evaluations
Evaluation Chains in LangSmith
This section discusses the different evaluation chains under LangSmith utilised in LLM evaluation:
Evaluation Chain | Description |
Correctness | Measures factual accuracy of the LLM’s outputs. |
Conciseness | Evaluates how brief and to the point the LLM’s response is. |
Relevance | Assesses how well the LLM’s response aligns with the prompt or question. |
Coherence | Evaluates the logical flow and internal consistency of the LLM’s generated text. |
Harmfulness | Identifies potential for the LLM’s outputs to cause harm (physical or emotional). |
Maliciousness | Detects if the LLM’s outputs intend to cause harm or mislead. |
Controversiality | Evaluates the potential for the LLM’s outputs to spark disagreement or offence. |
Misogyny | Identifies outputs containing gender bias against women. |
Criminality | Detects if the LLM’s outputs promote criminal activity. |
Insensitivity | Evaluates the LLM’s outputs for lack of sensitivity to specific topics or groups. |
Depth | Assesses the comprehensiveness and richness of the LLM’s response. |
Creativity | Measures the LLM’s ability to generate original and unexpected ideas. |
Detail | Evaluates the level of detail and elaboration provided in the LLM’s response. |
Langsmith for Practical LLM Tracing and Evaluation
Requirements – OpenAI API and LangSmith’s API are needed for this practical guide. Visit the link https://smith.langchain.com/ and create an API key.
The home page of LangSmith shows an overview of projects, datasets, and prompts.
Visit the settings page to create an API key.
Step 1: Install the necessary libraries for operating with Langsmith
!pip install langsmith langchain langchain_openai langchain_core python-dotenv
Step 2: Import ChatOpenAI for chat completion, ChatPromptTemplate for crafting prompt templates, StrOutputParser for parsing the generated response, and LangSmith for connecting with web UI and checking the results.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langsmith import Client
from google.colab import userdata
import os
Step 3: Pass the access API Keys for connecting with OpenAI and LangChain.
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_APIKEY")
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = userdata.get("LANGCHAIN_API_KEY")
Step 4: Craft a prompt template for understanding user questions and responding based on the provided context, configure the OpenAI’s GPT-3.5-Turbo model and define the string output parser.
prompt = ChatPromptTemplate.from_messages(
[
("system", "You are an assistant who responds to user requests using context"),
("user","Question:{question}\nContext:{context}")
]
)
model = ChatOpenAI(model="gpt-3.5-turbo")
output_parser = StrOutputParser()
Step 5: Create a chain by connecting the 3 main components – prompt, model and output_parser.
chain=prompt|model|output_parser
Step 6: Provide the list of questions and context for response generation.
question = ["What is the Game of Thrones", "Who are the main characters in the Game of Thrones", "Who dies in the Game of Thrones"]
context = """
Step 7: Execute the chain for creating a trace on LangSmith’s web UI.
for q in question:
print(chain.invoke({"question":q,"context":context}))
print("\n")
Output:
Step 8: Create a custom dataset for LLM Evaluation based on RAG parameters.
correct_answers = [
"A Game of Thrones is the first novel in A Song of Ice and Fire, a series of fantasy novels by American author George R. R. Martin.",
"Here are the main character names mentioned as per the context - Ned Stark, Robert Baratheon, Jon Arryn, \
Cersei Lannister, Jaime Lannister, Bran Stark, Catelyn Stark, Jon Snow, Tyrion Lannister, Daenerys Targaryen, \
Viserys Targaryen, Khal Drogo, Ser Jorah Mormont, Joffrey Baratheon, Sansa Stark, Arya Stark, Petyr Baelish, \
Lysa Arryn, Stannis Baratheon, Tywin Lannister, Robb Stark",
"Based on the context, these characters die - Jon Arryn, Viserys Targaryen, Robert Baratheon, Ned Stark",
]
client = Client()
inputs = [
(question[0], correct_answers[0]),
(question[1], correct_answers[1]),
(question[2], correct_answers[2])
]
dataset_name = "LS_EVAL"
dataset = client.create_dataset(
dataset_name=dataset_name, description="Questions and answers for evaluating",
)
for input_prompt, output_answer in inputs:
client.create_example(
inputs={"question": input_prompt},
outputs={"answer": output_answer},
dataset_id=dataset.id,
)
Step 9: Check LangSmith’s web UI (Datasets section)
You’ll find the dataset created based on the input and output examples as given in the Python code.
Click on the example to view it in detail.
Step 10: Go to the Projects section to check the trace.
You’ll find different trace data and their values in the trace runs.
Check one individual trace data to list the execution steps along with their traced elements.
You can also check your input, prompt and output in the ChatOpenAI trace.
Step 11: Visit the Datasets section and create an Evaluator. This evaluator will be used in our prompt for evaluation purposes.
Select ChatOpenAI as the provider, GPT-3.5-Turbo Model. Configure temperature value as your interest and select Create a prompt from scratch option.
You will be able to see the prompt template that the evaluator will operate on. Here, we are using Input, Output and Reference context for evaluations.
Step 12: Once the evaluator is created, Visit the prompt section and create a new prompt execution environment.
Click on Prompt to start the playground
Select your dataset (LS_EVAL) in the prompt playground, configure the OpenAI model as per your interest and click on start.
The execution begins and you can see the generated outputs.
Step 13: Visit the experiments page to check your evaluator run and its output in detail.
Step 14: You can see the correctness score (1 being correct and 0 being incorrect) which signifies whether the generated output matches our evaluation dataset’s correct output.
This concludes this hands-on guide on using LangSmith for Tracing and Evaluating LLM. LangSmith’s platform provides all the necessary tools for observing and assessing our LLM application. Through the trace, we were able to understand the technical parameters and the execution flows, and the evaluator helped us find out if our LLM model was able to generate correct results or not.
Final Words
LangSmith is an excellent choice for implementing observability and explainability in large language models. Using the tracing, evaluation tools, datasets and prompt playground users can understand, assess and improve their LLM operations easily and efficiently. Apart from LangSmith, there are some other exceptional tools for LLM tracing and evaluations such as Arize’s Phoenix, Microsoft’s Prompt Flow, OpenTelemetry and Langfuse, which are worth exploring.