What is LangSmith? Tracing and debugging for LLMs

What is LangSmith? Tracing and debugging for LLMs

Use LangSmith to debug, test, evaluate, and monitor chains and intelligent agents in LangChain and other LLM applications.

Credit: Dreamstime

In my recent introduction to LangChain, I touched briefly on LangSmith. Here, we'll take a closer look at the platform, which works in tandem with LangChain and can also be used with other LLM frameworks.

My quick take on LangSmith is that you can use it to trace and evaluate LLM applications and intelligent agents and move them from prototype to production.

As of this writing, there are three implementations of LangChain in different programming languages: Python, JavaScript, and Go. We'll use the Python implementation for our examples.

LangSmith with LangChain

So, basics. After I set up my LangSmith account, created my API key, updated my LangChain installation with pip, and set up my shell environment variables, I tried to run the Python quickstart application:

from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI()
llm.predict("Hello, world!")

I took the hint from the timeouts, and went to my OpenAI account and upgraded my ChatGPT plan to ChatGPT Plus ($20 per month). That gave me access to GPT-4 and the ChatGPT plugins, but my program still didn’t run. I left it turned on: I suspect I’ll need the additional capabilities.

Next, I remembered that the OpenAI API plan is separate from the ChatGPT plan, so I upgraded that as well, adding $10 to the account and setting it up to replenish itself as needed. Now the Python program ran to completion, and I was able to see the successful results in LangSmith.

Looking at the metadata tab for this run told me that it ran the “Hello, World!” prompt against the gpt-3.5-turbo model at a sampling temperature of 0.7. The scale here is 0 to 1, where 1 is the most random, and 0 asks the model to auto-tune the temperature.

Overview of LangSmith

LangSmith logs all calls to LLMs, chains, agents, tools, and retrievers in a LangChain or other LLM program. It can help you debug an unexpected end result, determine why an agent is looping, figure out why a chain is slower than expected, and tell you how many tokens an agent used.

LangSmith provides a straightforward visualisation of the exact inputs and outputs to all LLM calls. You might think that the input side would be simple, but you’d be wrong: In addition to the input variables (prompt), an LLM call uses a template and often auxiliary functions; for example, retrieval of information from the web, uploaded files, and system prompts that set the context for the LLM.

In general, you should keep LangSmith turned on for all work with LangChain—you only have to look at the logs when they matter. One of the useful things you can try, if an input prompt doesn’t give you the results you need, is to take the prompt to the Playground. Use the button at the top right of the LangSmith trace page to navigate to the Playground.

Don’t forget to add your API keys to the website using the Secrets API Keys button. Note that playground runs are stored in a separate LangSmith project.

LangSmith LLMChain example

In my introduction to LangChain, I gave the example of an LLMChain that combines a ChatOpenAI call with a simple comma-separated list parser. Looking at the LangSmith log for this Python code helps us understand what's happening in the program.

The parser is a subclass of the BaseOutputParser class. The system message template for the ChatOpenAI call is fairly standard prompt engineering.

from langchain.chat_models import ChatOpenAI
from import (
from langchain.chains import LLMChain
from langchain.schema import BaseOutputParser
class CommaSeparatedListOutputParser(BaseOutputParser):
    """Parse the output of an LLM call to a comma-separated list."""
    def parse(self, text: str):
        """Parse the output of an LLM call."""
        return text.strip().split(", ")
template = """You are a helpful assistant who generates comma separated lists.
A user will pass in a category, and you should generate 5 objects in that category in a comma separated list.
ONLY return a comma separated list, and nothing more."""
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_template = "{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])
chain = LLMChain(

LangSmith evaluation quickstart

This walkthrough evaluates a chain using a dataset of examples. First, it creates a dataset of example inputs, then defines an LLM, chain, or agent for evaluation. After configuring and running the evaluation, it reviews the traces and feedback within LangSmith. Let’s start with the code. Note that the dataset creation step can only be run once, as it lacks the ability to detect an existing dataset by the same name.

from langsmith import Client
example_inputs = [
  "a rap battle between Atticus Finch and Cicero",
  "a rap battle between Barbie and Oppenheimer",
  "a Pythonic rap battle between two swallows: one European and one African",
  "a rap battle between Aubrey Plaza and Stephen Colbert",
client = Client()
dataset_name = "Rap Battle Dataset"
<strong># Storing inputs in a dataset lets us
# run chains and LLMs over a shared set of examples.</strong>
dataset = client.create_dataset(
    dataset_name=dataset_name, description="Rap battle prompts.",
for input_prompt in example_inputs:
    # Each example must be unique and have inputs defined.
    # Outputs are optional
        inputs={"question": input_prompt},
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
<strong># Since chains and agents can be stateful (they can have memory),
# create a constructor to pass in to the run_on_dataset method.</strong>
def create_chain():
    llm = ChatOpenAI(temperature=0)
    return LLMChain.from_string(llm, "Spit some bars about {input}.")
from langchain.smith import RunEvalConfig, run_on_dataset
eval_config = RunEvalConfig(
    # You can specify an evaluator by name/enum.
    # In this case, the default criterion is "helpfulness"
    # Or you can configure the evaluator
      {"cliche": "Are the lyrics cliche?"
      " Respond Y if they are, N if they're entirely unique."}

We have a lot more to look at for this example than the last one. The above code uses a dataset, runs the model against four prompts from the dataset, and runs multiple evaluations against each generated rap battle result.

Here are the evaluation statistics, which were printed in the terminal during the run:

Eval quantiles:
             0.25  0.5  0.75  mean  mode
harmfulness  0.00  0.0   0.0  0.00   0.0
helpfulness  0.75  1.0   1.0  0.75   1.0
cliche       1.00  1.0   1.0  1.00   1.0

The LangSmith Cookbook

While the standard LangSmith documentation covers the basics, the LangSmith Cookbook repository delves into common patterns and real-world use-cases. You should clone or fork the repo to run the code. The cookbook covers tracing your code without LangChain (using the @traceable decorator); using the LangChain Hub to discover, share, and version control prompts; testing and benchmarking your LLM systems in Python and TypeScript or JavaScript; using user feedback to improve, monitor, and personalise your applications; exporting data for fine-tuning; and exporting your run data for exploratory data analysis.


LangSmith is a platform that works in tandem with LangChain or by itself. In this article, you've seen how to use LangSmith to debug, test, evaluate, and monitor chains and intelligent agents in a production-grade LLM application.

Tags large language model

Show Comments