Monitoring the Magic: Operating LLM-Driven Products
A brief exploration of what makes LLMs hard to deploy in production for builders and customers, a framework for understanding the problem, and some strategies for making it better.
So, you want to add a chatbot
I’ve been working with LLMs for the past few months across every part of the technology stack: from low-level (pytorch and ggml) to abstraction and orchestration layers (LangChain); from OpenAI to Claude to Llama and back; from prompt engineering to agent planning; from development to deployment, and now I’m working on monitoring and evaluation.
I’ve also been talking with leaders across every part of the ecosystem: from the LangChain core team to futuristic indie projects at the AGI House to old-school businesses who don’t even have a tech team that are desperate to use the technology.
One thing stands out across the landscape: LLMs are a new kind of product that have new kinds of user behavior, business expectations, and engineering requirements. I want to talk about a specific comment I hear a lot of people talk about, and how it creates a symphony of confusion and opportunity for designers, operators and engineers to build great products.
LLMs are magic…great.
The amazing thing about LLMs is that you don’t know what they’re going to spit out, and the quality so far has been mind-blowing (hence the hype). But if we’re going to build products that people understand and use reliably, we need to start moving towards a framework for engineering things that are unpredictable. Let’s first address some misconceptions about the nature of LLM “magic”.
“Random” does not equal random
For most LLMs today, you can simply set temperature=0
and reliably get the same output given the same input. So it’s not that the process that generates text is always random, it’s that we don’t really understand how LLMs pick one word over another. People throw around the word “random” when what most people really mean is “incomprehensible.”
As product designers and engineers, we actually don’t care about how the model itself manages to navigate the enormous possible combination of words to write a paragraph. That was the job of the PhDs and research engineers who figured out how to train the model successfully. We care about whether the paragraph generated does what we want it to. Thinking at the right level of abstraction is key to understanding where we should target our efforts.
When we humans say complicated, we’re talking about how we can even remotely try not to generate speech, but how to evaluate and monitor the speech that an LLM has generated. To add more challenge, the way we’ve been evaluating and monitoring production services over the last 30 years has been with numbers, math and metrics, and that’s not so easy now. While in the future we’ll develop better systems that leverage language to monitor language, right now we need the framework to build and manage LLM products today using existing tools.
Every operator’s problem with LLMs is sensitivity and dimensionality
It’s easier to understand sensitivity by first understand its opposite: smoothness. Smooth functions are easy to understand: small changes in inputs lead to small changes in output. They make sense to both designers and engineers, because most of the functions we see in the world are smooth: weather doesn’t just go crazy from one minute to the next, the changes are gradual and as a result, understandable.
In contrast, LLMs are very sensitive, for a few reasons. It’s simple to understand: if you add just a single word (”not”), you change the meaning of the sentence entirely. If you add an adjective, the nuance and everything afterwards shifts dramatically.
Let's also return to the question of dimensionality: a conversation can go in an infinite number of directions, and as many have experienced, LLMs get derailed and lose focus easily. What’s even harder is that there are as many good conversations as there are bad ones. It’s easy to monitor something that has only one good stable state, it’s much harder to monitor something that has many different ways for something to go right.
Now, the challenge for operators (product and engineers) is to test and monitor all the nuance of language at scale. How can we start the work of taming a temperamental genie in a bottle?
Building smooth, sensitive products
We can start by understanding specific types of sensitivity and the nature of sensitivity in production, as well as kinds of product designs for creating smoothness that mitigate sensitivity while not compromising on the robustness and flexibility of LLMs.
1. Sensitivity compounds over time
For every action or response an LLM needs to generate, the chance it gets a single step wrong may be low, but over time this compounds. On the other side of the coin, users of conversational chatbots in production uniformly complain about agents losing track of conversation.
It’s easy to work out the math: assuming a generously low 5% error rate at any step, for a 10 step sequence, or chain, of agent responses, the probability it never takes a wrong step is <60%. Similarly, assuming that an agent is able to maintain a thread between steps even 90% of the time, then the chance it’s able to maintain the thread over the whole sequence is <35%.
Limit chain lengths
The easiest way to address exploding sensitivity is to limit how long LLM chains can run. Chains where the model feeds inputs from previous outputs are common in autonomous agent paradigms; the shorter you can make them, the less likely they are to get off track. In chat chains (from a person to a model and back), it’s a poor customer experience to limit conversation length, but you can create affordances through user and model guiding that make chat experiences not only less likely to derail, but more intuitive for customers.
Guiding and steering
When designing your LLM apps, prompt the model towards specific goals or ends, and add in the prompt examples of actions that the chat agent should guide the conversation towards. For example, if you’re building an airfare ticket agent, then the prompt should include something like
Your goal as an airfare agent should be to determine if the user wants to
1. buy a new flight
2. reschedule or cancel an existing flight
3. upgrade their seats
This helps keep the agent on track. When the agent detects a user intent, your app can then provide more clear affordances to the customer on how to proceed, terminate this specific conversation chain and start a new one with more specific prompting.
In the same vein, you can prompt agents to avoid certain points, outcomes, topics and tones, or steer them away from undesired behavior.
Setting user expectations and creating user affordances
The first thing I notice when I show ChatGPT to a person who hasn’t seen it before is “well, what do I do with this?” Recalling Chris Dixon: “the next big thing will start out looking like a toy.” Toys are about playing and discovering what an app is capable of. How does an app grow from a toy to a product? Clearly communicate affordances on what the app is capable of, and provide designs that reference existing mental models for customers – tell them exactly how your app solves their problem and exactly how to use it.
2. Dimensionality makes evaluation hard
How do you decide whether your new chat bot is working? Of course there are the existing business KPIs, but now you have a lot more nuance to track: does your agent stay on topic, hallucinate, have the right tone? These are much harder to measure in practice, and the best practices for it are still evolving. On the non-chat side, how do you make sure your prompts are still working when you make modifications? Are you able to identify key metrics that you can measure and monitor over time?
Proxies
One way to make evaluation possible at scale and monitor production performance is to use language models to evaluate performance, reduce them to a metric, and then report on that sample measurement over time. This lowers the dimensionality. I'm building automated testing and monitoring frameworks around this paradigm with michina
. A simple example of this is tone checking.
def test_respond_to_customer():
customer_message = "I want to buy a campaign poster for Obama."
response_message = respond_to_customer(customer_message)
tone_check = ToneCheck.check(response_message, "polite")
assert tone_check.judgment > 0.5
You can see the full example at the Github. Engineers can tie these to existing monitoring and alarms to start tracking behavior in production.
One major danger here is the McNamara fallacy: if we pick metrics that don’t correlate with the outcomes we really care about and choose to ignore what’s difficult to measure. That’s especially dangerous for customer experience. Monitoring tone or topicality isn’t enough, you also need to measure outcomes.
Negative and positive signals
If user interactions start to go too long, or our prompts start going awry in production, we want to trigger alarms. In the same breath, we also are going to need to add signal for when our prompts are working correctly, and leading to positive outcomes, not just negative ones. We want to monitor KPIs like conversation length as well as continually collect feedback from users to improve prompts and agent planning systems.
3. Sensitivity and dimensionality are discovered, not determined
How do you know what words lead to major changes in output that you want to avoid? Similarly, what words/phrases regularly lead to good behavior? Words/phrases in prompts that have outsized influence on the outcome are triggers. They’re difficult to predict in advance because of how sensitive LLMs are, as well as how large the space of sequences can be, but we can discover them monitoring in production. A/B tests may be a way to determine what patterns or phrasing lead to what kinds of outcomes in practice.
Bonus: understand practical implications of theory and research
New research comes out every day on how LLMs behave in the margins. Being able to keep up to date with what researchers have discovered about the emergent behavior of transformers, their limitations and how to mitigate them is key to fast product iteration. On the frontier, staying abreast of tech is a combination of talking to users and watching Arxiv. A simple example is how influential chain-of-thought prompting has been on prompt engineering. A deeper example: this paper determined a certain class of problems that transformers can’t solve, and they followed up with more practical implications for identifying model hallucinations.
The future is blended
As all hype trains go, some apps are trying to smush existing square products into the round hole of LLMs where they don’t really fit, due to poor affordances, missing mental models and opaque definitions of success. Not everything needs a chatbot. The future will be apps that know when to use which tools the right way, blending LLMs and existing design patterns.
Similarly, there’s a mismatch between how we operate square products and how we need to operate round LLMs. To move fast today, we need to make the round peg of LLMs fit in the square hole of metrics, monitoring and existing operational practices. By finding ways to tackle sensitivity and reduce dimensionality, we can build products that are easier for customers to use and easier for teams to operate. As we evolve, the right round pegs will emerge: new paradigms for evaluating and monitoring LLMs will become clearer as more companies experiment with their apps. I think probabilistic testing and fuzzing 2.0 are on the right track.
If you're working with LLMs in production, I'd love to chat. Find me on Twitter.