Eval’s where it’s at
Somewhat drowned out by all the chatter and hype about vibe coding, I'm glad to see that a much more important conversation is picking up steam.
Five months ago already, Eugene Yan, a Principal Applied Scientist at Amazon, shared this insight on LinkedIn:
Evaluating LLM output is hard. For many teams, it's the bottleneck to scaling AI-powered product.
That topic didn't get much attention in the following months, but now, at least, I've seen engineers and product managers discuss it much more.
Evals tie in nicely to yesterday's post about tight feedback loops. If you cannot quickly and automatically evaluate your AI models, you're in long-runout territory where feedback comes too late for comfort.
In a way, evals are to AI what unit tests are to traditional software. They're the written-down assumptions and constraints we want to put on our system and allow us to check whether a change we're contemplating moves us toward or away from our goals.
Of course, they are very different in another way. Unit tests are deterministic and check a deterministic path through the code: "Given this input, verify that the output is such and such." The evaluation of an LLM cannot be expressed like that, and it gets even more complex when we want to evaluate the effectiveness of an AI agent.
LLMs to their rescue
So what are we to do? Imagine building a special AI tool to turn the abstract of a scientific paper into an eye-catching blurb for LinkedIn to help your research department's social media team. The output will be different each time, and you can't hard-code an evaluation that replies with "good" or "bad."
But you can create another LLM to use as the judge, or, possibly, two individual judges:
One judge will check that the social media blurb does not hallucinate.
Another judge will check that the blurb is catchy.
It's easiest to issue a Pass/Fail here, but of course, you can ask the LLM for more nuance.
For more complex tasks, imagine having several judges, each focusing on one aspect. The advantage of this is that each one will have a less complex prompt.
Once in place, you can start iterating on the AI system you're building by grabbing a set of test abstracts and experimenting.
Does the eval improve if you change the underlying model from ChatGPT 4.5 to Claude 3.7?
What about tweaking the prompt?
Have we hit diminishing returns and it's time to fine-tune the base model?
Tying these decisions and the experimentation with models and prompts to concrete evals is the essential piece that turns the black art of AI whispering back into an engineering discipline.