Brief summary
Generative AI's popularity has led to a renewed interest in quality assurance — perhaps unsurprising given the inherent unpredictability of the technology. This is why, over the last year, the field has seen a number of techniques and approaches emerge, including evals, benchmarking and guardrails. While these terms all refer to different things, grouped together they all aim to improve the reliability and accuracy of generative AI.
To discuss these techniques and the renewed enthusiasm for testing across the industry, host Lilly Ryan is joined by Shayan Mohanty, Head of AI Research at Thoughtworks, and John Singleton, Program Manager for Thoughtworks' AI Lab. They discuss the differences between evals, benchmarking and testing and explore both what they mean for businesses venturing into generative AI and how they can be implemented effectively.
Learn more about evals, benchmarks and testing in this blog post by Shayan and John (written with Parag Mahajani).
Episode transcript
Lilly Ryan: Welcome to the Thoughtworks Technology Podcast. I'm your host, Lilly Ryan, and I'm speaking to you from Wurundjeri Country in Australia. Today, we'll discuss benchmarks, evals, tests, and what it really comes down to, the renewed interest and investment in quality assurance and testing that's been sparked by businesses' attempts to put generative AI-backed solutions into production. To guide us in that discussion, we are talking to Shayan Mohanty and John Singleton. Shayan is head of AI research and John is program manager at the Thoughworks AI lab. Both are former co-founders of Watchful. Welcome.
John Singleton: Hey.
Shayan Mohanty: Thank you so much for having us.
John: Looking forward to it.
Lilly: Could you tell our listeners a bit about who you folks are, your background in the industry, and what you're working on at Thoughworks?
John: Yes, I'll kick it off. Like Shayan, we came into Thoughworks under the acquisition of Watchful, or the company that we co-founded, originally starting off to help automate the process of labeling data. We operated that for about five and a half, six years, and are now part of Thoughworks as of almost to the day, actually to the day, eight months. Huzzah, this is our eight-month anniversary. It's been super exciting.
Prior to Watchful, I've done a number of startups, sales, marketing, and operations roles, even worked with the inventor of selective laser sintering, 3D printing, Dr. Carl Deckard for a period of time, making ink for 3D printers. Now I am principal program manager here at Thoughworks, helping proliferate all the good work and manage all the amazing work that Shayan and the research team are heading up.
Shayan: Yes. I am Shayan. I'm the previous CEO and co-founder of Watchful, along with John. Did all the stuff there that you'd expect, tried to build a company, did the thing, built a product, got the company sold. Whoo, and now we work on really cool stuff at Thoughworks, obviously. Before all of that, I used to work at Facebook. I led the stream processing team that ended up building all the ads network infrastructure for all of the ads network calculation stuff that needs to happen for all the advertising partners on the order of 5 petabytes of data per day.
Then after that, I built what's called the MAD team, which is machine learning AI data. That was sort of to sit between FAIR, the Facebook AI Research Lab, and applied machine learning. All that say, a long history in distributed systems land meets machine learning and AI. Now super focused on AI research and moving the AI industry forward. Just to maybe tee up the conversation here, our work is very centered around interpretability, evaluation, and just assessing reliability of GenAI-based systems.
Lilly: I like that way of putting it because it cuts through a lot of the buzzwords that we've got kicking around, benchmarks, evals, tests. There's quite a lot of hype around all of these kinds of things. One thing I like about the show is that we can cut through it and talk about how we might apply something in practice. From your point of view, what's the difference between benchmarks, evals, and testing, and how do people go about putting them to practice?
Shayan: I wanted to first start with testing, because I think that's the most obvious foundation that everyone knows and loves. Obviously, I'm going to paint with a super broad brush here. When you talk about testing, you're usually talking about something that's, in some way, validating or invalidating some behavior. There is a known good type of behavior that you're literally testing for, and there's sort of a pass/fail criteria in some capacity on these tests. That's sort of the assumption that you build around things like CI/CD, and so on.
This is just normal software testing that everyone knows and loves. Again, notwithstanding a whole bunch of other types of testing, which are all useful in their own rights. Again, just in the spirit of talking about shapes, you can think of testing as for validating broadly. Then I want to talk about benchmarks, because that is sort of the thing that everyone jumps to when you think about "testing" for GenAI.
Lilly: Can this model pass the bar?
Shayan: Yes, exactly. It's like, can this model pass the bar? It could also just be like, you have some AI salesperson knocking on your inbox and they're like, "Hey, I've got the new hotness and model and it beat Glue, it beat this, it beat that, it beat such and such, it outperforms Llama 405B on such and such benchmark." There's a whole industry now that has been manufactured around the idea of gaming benchmarks.
We should be very clear about what a benchmark actually is. It is a data set, ostensibly. It's a data set with inputs and some criteria around scoring and output. That's the whole thing. The thing about benchmarks is that, by definition, they, are in some way, open, meaning people have access to that data. Meaning, in theory, if you were to want to outperform such and such benchmark, this is not to say you can do this with every benchmark, but for most benchmarks that are out there, what you can do is you can just make sure that that data is included in your training set in some way. That, in some ways, guarantees a high performance in some capacity on that benchmark.
We should think about like what benchmarks are actually good for that. That's not to just hand wave and be like, "Oh, benchmarks are completely useless." They are often useful tools. Just to be very clear, there are two core limitations to benchmarks. One is that you can't over-rely on the model's ability to perform on a benchmark just because, again, it's so easy to game. The second thing is that just because a model performed well on a benchmark does not necessarily mean that that same model will perform well on your use case, right?
I'm being very specific about my choice in words here when I say model versus an application, right, because when we talk about performance, we have to also consider the backdrop of the application itself. All the ways that you might manipulate a context window, the retrieval part of RAG, for instance. We have to start thinking about all of that in a broader context when we start talking about reliability, because it's not just the model that matters.
When you think about benchmarks, again, with the broad brush here, you should generally think about model to model comparison. If testing is for validating, benchmarks are for comparing, let's say, then what are evals, right? What are some examples? Once again, I keep caveating with this, but very, very broad brush.
There's so much nuance and so much gray area here, but just in the spirit of keeping things simple, in the evals world, and when we talk about evals, we generally talk about metrics or some sort of score or something like that, and some large number. There are some number that describe the properties of the thing you're trying to build. These could be directly against a model, this could also be directly against an application.
You could be measuring properties about an application where a model is part of it, or you could be testing the model itself, like it just depends on what you're trying to do. There are intrinsic and extrinsic metrics, okay? Intrinsic are things rouge, blue, like scores that you can calculate independent of knowing anything about the use case, anything about really anything, right? Perplexity is a metric. You don't have to know anything about, is this model trying to classify? Is it not? What's the application trying to do? You don't even know any of that stuff. These are things that you can always very easily calculate.
Extrinsic are things like user feedback as part of an application structure. Maybe you have a thumbs up, thumbs down type of user interaction. You produce some output, and then the user says, "No, that was a bad output." That's extrinsic evaluation that you can then include as part of an internal evaluation data set, for instance. I want to be clear here. Testing is broadly for validating. Pass-fail criteria. That assumes that you know something about what good looks like.
One of the hard parts about building with LLMs is that you don't always know what good looks like, or at least you know it when you see it, and you can't really articulate how exactly it's good or bad or something like that without, again, seeing it. Testing is for validating. Benchmarks are for comparing. Evals are for understanding.
Before you're able to assert anything, you probably want to just understand what is the normal sort of set of parameters, what metrics are normal for you to see. What's abnormal? What does that look like from the perspective of almost these features that you've tooled in? You can always build tests on top of evals, right? You can have a metrics calculation step in CI/CD. Then you can test against that. You can be like, "Okay, normal operating range for perplexity needs to be between here and here." You only know that because you spent some time basically profiling your application and typical use cases.
Lilly: That's something that I think is really worth looking at because when people are talking about benchmarks and evals and testing in all of these cases and the terminology gets confused, I really appreciate the clear delineation between each of these three terms that you've spelled out for us. When that comes up in a business context, folks don't always think about how it applies to their context specifically. I think that there are a lot of folks who are looking to generative AI as a way of helping them understand their own context as if they haven't spent that much time or haven't really understood the value of looking at their own context in a way that makes it easy to measure.
When it comes to how you might apply this, not just at the level of the model itself, and you've spoken about a range of different applications for this, but when you're looking at it in the context of a whole business, and John, this is a question I have for you in particular, what do you think people do need to be thinking about for their own business in order to make these kinds of metrics effective?
John: I would say that in the world of GenAI, that absolutely nothing has changed. That you need to approach this just as you would any other application, whether it be GenAI or otherwise that you would look to be bringing into your business. You need to establish a way to understand what ROI looks like, what success looks like, and adhere to those metrics and understand that to its core before you start looking for a problem with this new tool.
I think that a lot of people are looking at GenAI right now as a panacea, and this is a thing that they can instantly drop in for a wide variety of problems. While at the surface, that is fundamentally true, when we start talking about bringing applications to production, the bar and expectations for doing so, it's expensive in enterprise. We're not talking about you or I doing some brainstorming or producing some boilerplate code. We're talking about potentially thousands of people, hundreds of thousands, or even millions of users, a level of exposure and risk.
I like using this phrase, and for the non-Americans, I've learned this maybe isn't Americanism. You have to determine, is the juice worth the squeeze? The fundamental thing about evaluations and LLM evals, and it's worth maybe setting a little stage here. I feel we went pretty deep, pretty quickly into definitions, like why do we care so much about this? Why is this such a fundamentally hard problem? Is that businesses have lost in the realm of machine learning models, have lost a toolkit, a toolkit to measure and understand model behavior when things were working, when they weren't.
When you had a constrained output space, a classifier, for instance, classifying this as good or bad, or this should go to accounting versus going to repairs. You could use tools like accuracy, recall, precision, F1 scores to get a good understanding of model performance at any given point in time. Now, when we have a virtually infinite output space using this new tool, our friend, the transformer model in large language models, we now lose a lot of the same guarantees.
Not only that, the ways in which we measure performance of a particular one task that an LLM may be performing, and then to the entire applications output, we have to make sure that all of these things are brutally adhering to those KPIs that we have defined and cared about and actually make the "juice worth the squeeze". We have to have a robust framework that is transparent and that we all agree upon to understand how we're measuring performance within that context.
Lots of words for focus, really, on what is the actual key problem in using the right tools at the right time and understanding that if you choose this tool, which is a great fit for a set of problems today in the enterprise, that comes with a new set of muscles and new set of actions that you need to learn to adopt this into production.
Lilly: I'm always interested in the ways that people get this wrong because with my role in security, I'm always looking at those edge cases and the failure cases and those kinds of things. We're talking about moving into production, and I've seen quite a lot of apps that have reached the proof of concept stage that don't go beyond that point. There's, I think, a lot we can learn from the failures of those proofs of concept that can really inform what does make it to production. What should people be looking at when it comes to what goes wrong, and where do you actually see it going wrong?
[00:15:11] Shayan: I'm going to jump in with a couple of things. When you think about evals at the moment, when you think about what the industry is pointing to and what is currently being done, there are some intrinsic metrics, things like perplexity that are being used. However, I think people's intuitive understanding of how to use those metrics is not always in line with like reality.
For instance, perplexity is often used as a metric of uncertainty. In practice, if, for instance, you're predicting in one token or very, very small token outputs, perplexity is really a measure over things like vocabulary size and a couple other things. Practically speaking, you might end up with a metric that's not really that interpretable or you're interpreting the wrong way.
To John's point, nothing's changed. We've had a very long and storied history as an industry of people misinterpreting metrics like accuracy, precision, recall, F score, and so on, so on. There's nothing new there, but it's just yet another thing to consider. It's just like, do you know what these metrics actually do, what they're actually measuring? That's point one, and that's an obvious thing.
A second thing, which is still obvious, but perhaps less obvious, is thinking about where trust starts and ends in a system. I think LLM as a judge has become a common new way of evaluating LLM outputs and things like that. I'm not saying LLM as a judge is foundationally wrong or anything like that. It's just like, again, you have to consider, does that give you the level of coverage and reliability that you need for the given application? In a world where you have a black box, basically evaluating another black box, how have you determined that you can trust that particular assessment? There's that.
I think the third is on disambiguating preference versus desired outcome, if that makes sense. Let's say that you have an extrinsic evaluation, like your users have indicated that they thumbed down a bunch of responses. What does that actually mean? Does it mean that those outputs are factually incorrect, perhaps the thing that you're trying to guard against? Does it mean that they just didn't like the way it was formatted, for instance? Just collecting the right level of information from these different sources is super important.
All that to say, if you squint at this, this is looking and sounding a lot more like a data science problem, right? It's almost shifting away from classical engineering of testing and pass-fail criteria and stuff like that. Now we're going into many layers of required interpretation of these metrics and like what they really mean in practice, which again, should not be surprising because AI has historically been a data science dominated field. It makes sense that the thing that comes right after, like understanding of these AI systems at the moment lives in data science land. I would just call those three things out. John, do you have anything else you want to add to that?
John: I think it was pretty comprehensive.
Shayan: Yes, thank you.
Lilly: One word that we haven't brought into this discussion yet, but does get used a lot in the context of evals and benchmarks is guardrails. Where do you see those sitting in this suite of buzzwords, and what is the practical application? How are they different? Why are they not part of what you're looking at when it comes to testing?
Shayan: It's a very, very good question. Practically speaking, you can think of guardrails as remediation, right? For instance, in the case of a lot of AI applications, not only do you want to know about failures, ideally very, very early, but in a lot of cases, you won't know, you won't be able to necessarily predict all failures. You're exploring an infinite space at that point and you're just not going to get that far.
Practically speaking, you try and do as much evaluation as you possibly can. You try and get as much coverage as you possibly can, but there's obviously a sort of diminishing return to that. The next thing that you do is you say, "Okay, I know I can't possibly predict every possible future," but what I can do is I can say, "When we put this thing into practice in production, basically like a sidecar, for instance, to the model or the application that keeps it on the straight and narrow."
Now, the reason why that's not testing or part of the evaluation or measuring and so on is because that's not what it does, it remediates. If something breaks such and such rules or guidelines or guardrails, if you will, then we basically kill the entire chain. We just say, no, don't do that. We perhaps try again. We could force a retry. We could try and inject safety stuff into it. There's a wide variety of different shapes of remediation right now.
I will say though, that from a research perspective, we are interested in understanding this entire topology, if you will. We believe that evaluating and measuring is just one side of the coin for remediation. The question that we have in our mind is, can we use the same apparatus both for measuring as for remediation in various ways? As you make your evaluation more and more comprehensive, can you also then make your guardrails more and more comprehensive in the same ways? That's sort of one direction that we're going with.
Another thing that we're trying to get a better sense of is like, what if guardrails were not a sidecar? Is it possible to just put it as part of the model architecture in some way or something a little bit closer to the actual underlying operating mechanisms of the model? Such that you get richer ability to guard, right? In the case of guardrails today, oftentimes, what you're talking about is a stack of regexs, maybe coupled with some other string based rules, and maybe possibly also coupled with like another LLM that is involved in some process. You get something that's like, in a lot of ways, like duct tape and bubble gum, just like a Frankenstein's monster of a sidecar.
Whereas we think that there may be something more holistic that can be done. What if we can target it in the case of a white box model, for instance, you're running something like Llama. What if we can target the specific circuit of activations in the model that causes certain types of behaviors that we find to be undesirable? Then we just dial that down. Then no matter what the model does, it won't ever activate that particular transformer circuit, which means that we won't get the undesirable outputs just as an example. We think it's all part of the same question, but I think at an industry level, those two things haven't really been connected yet.
Lilly: What you're talking about there is a really specific research question that I find pretty exciting. If somebody wants to go and apply this thing in their context at this point of time, is that something that can be pulled out, or is it an evolving space? Where should we be looking for these kinds of conversations, and where can we be exploring this in our own context?
Shayan: On that particular topic, A, no, unfortunately, there's nothing off the shelf that can be pulled and there's no magic yet. I think a lot of people are working on it, not just us. This field broadly, people call it mechanistic interpretability and it has various jumping off points. Obviously, it's not just about interpretability, it's also about alignment in a lot of ways. How do we align models to sort of the outcomes that we desire? Those are the broad keywords.
Now, obviously, you should look to us as well. I'm hoping that we're going to put out some good research in this space. Obviously, it's going to take a little bit of time. Other than that, Anthropic's Mechanistic Interpretability Research Group is doing really interesting stuff. I would also say that EleutherAI, which is more of a cohort of researchers, less a company, more a loose research group, Eleuther AI has been putting out some phenomenal mechanistic interpretability and alignment stuff recently. They're worth looking at as well.
In terms of contextualizing, that's going to be an exercise for the reader, unfortunately. All of these things are so specific, and we're barely scratching the tip of the iceberg in terms of what is even happening under the hood with these models, that anything past just looking at existing research is itself a research topic.
Lilly: John, from the stuff that you've seen, from the folks that you've spoken to, what approaches are people taking to get these conversations started when it comes to understanding how a model or a set of models or a set of architectures can work in their context? In some cases, getting the buy-in from the people around them at their businesses and organizations to make it a conversation that is understood at all levels.
I realize we've been speaking a lot here about some fairly low level detail. We also know that we need to extrapolate that and look at the high level business context if we're going to apply it in a way that makes sense and is effective. How are folks starting these conversations, and where would you recommend that people begin if they're at the beginning of this journey, if they listen to this going, "Yes, this is it. This is what we need." Where do we start with that?
John: Ultimately, I think we need to recognize that the buyer for AI has changed. What AI is being called today has changed from what it was, let alone 5, but certainly 10 plus years ago. Now we have line of business owners, product managers, even that have significant budgets and a lot of wherewithal to select tools and tech stacks for what may or may not work. Ultimately, they have a lot of power.
I think it's important to recognize that bringing those business stakeholders along that may not have the depth and breadth of experience within ML and an understanding of what even the art of the possible looks like today is probably one of the most important things. I think one of my favorite things to say is, "No, I don't think that's reasonable."
Basically, if you can have, in our context, the client, but in the listener's context, maybe your coworker or your particular manager or something like that is getting a deep understanding of what is the business context, what is the application that we're looking to actually solve here, and then what is the level of trust and reliability, to Shayan's point earlier, that's necessary to bring this into production? Then comparing that against what is your existing risk profile, how do you measure and evaluate that today.
Making sure that these business stakeholders who, again, don't have the necessary technical chops or deep hands-on experience, getting them to understand what is and isn't possible despite whatever they played with on ChatGPT the night before and came in with a new project for everybody for the next quarter. Really just, I feel like a broken record, making sure that everybody's brought along and aligned on business KPIs that actually matter. Then squaring that up to what's actually reasonable and possible, and effectively measuring and evaluating the risk for that particular application.
Lilly: I mentioned at the top of the show that there's this renewed interest and investment in quality assurance and testing and all of these kinds of things. We know that we are talking about evals, about benchmarks, about all of these things. In a coordinated fashion, it's risen up out of the adjacent possible to become what it is at the moment. What do you think is really important about why this is happening at this moment? What does it mean for the industry? What does it mean for the questions we're grappling with at the moment? Where do you think we're going?
Shayan: Okay, there are a couple pieces to this. One is not to overblow it too hard here, but there's sort of an AI existential threat, maybe two. There's like an existential threat to the industry. Then, for the doomers out there, so to speak, there's the existential threat that AI poses the humanity, right?
I say that one a little bit tongue in cheek, but practically speaking, we really do need to think about the ways in which we as humans interpret that an AI system is working as expected. What does that actually mean? How do we bring that to the average human user, a non-data science expert who can't reason about metrics like perplexity and so on, but really do want to understand, is there potentially an issue with the last thing that the LLM said to me, or ideally, before the LLM does something wrong or the AI does something wrong, ideally that gets caught, or I have some window into how reliable my system is.
Let me sort of start with the existential impact to the industry. Obviously there's a ton of dollars being spent at various levels in the AI industry as a whole at the moment. All the way up and all the way down the stack, like at the very, very bottom, you've got chip manufacturers. Arguably, there are folks even under that, but let's just start there. Very large and powerful companies now. You've got model developers. Think about all the various vendored models that currently exist and who is building them and how much money they've all raised. There's an insane number of dollars that have all been collectively pooled for that type of thing.
Then you've got AI, like vertical AI companies, right? Companies utilizing AI for specific use cases in specific industries. If you think about that spectrum and, obviously, a missing pieces of a more holistic picture here, but at least you look at those three segments. They command many, many billions, hundreds of billions of dollars currently at the moment, exceeding, I don't want to promise anything, but probably nearing, if not exceeding a trillion at the moment, if we're including market cap, right? That's sort of what we're looking at, at the moment.
Now, what do evals, benchmarks, and tests really get you? In a perfect world, it's trust, right? It's that a human or a set of humans can trust that a system is working as expected. Now, at the moment, we have a ton of POCs that are being built, a ton of different experiments that are being tried, a bunch of tiny little credit card swipes and not a whole lot of floods of cash that are coming in from specific use cases and specific ROI.
We have this holding pattern of POC to production. That's not to say that all of this is because we haven't yet figured out what the correct metrics are and how to measure AI overall. That's certainly part of it. We, as people who build things like POCs, we can put a finger in the air and do a litmus test, if you will, a little vibe check and be like, "Oh, this looks generally right." We lack the conviction that this is ready for production because it might be risky in various ways. The risk is, at the moment, very difficult to quantify.
What does this really mean? That means that we've got a ton of things that are basically waiting on the sidelines before they can really be put into production, before ROI can be realized. Now, it's not just that people need to get the conviction to put it into production. It's once it's in production, it needs to be tuned. It needs to be corrected for behaviors. Those behaviors need to be identified. There's an iteration loop that is also missing that isn't really talked about, but validation or, at minimum, evaluation is part of that. Think about classical REPLs, for instance, in software engineering. You can't really loop unless you've evaluated, for instance. All of that is happening.
Now, at an industry level, let's say that this doesn't get fixed. All POCs remain as POCs. Things don't move to production. That means that AI can't be used for the most important problems in the world where theoretically it could be used for things like massive drug discovery, things around reinventing interesting fraud detection schemes. There's a wide variety of different interesting use cases that might require deep context and really deep integration with AI that just won't come to fruition. ROI won't be seen. That means that a bunch of companies that have raised a ton of money and the people that they raise that money from are all out on the streets, if you will. There's sort of an existential risk at an industry level.
Then, again, at the very end of this, we've got the relationship between humanity and AI. I don't want to overblow this, but there is something to be said about how do we make sure that a largely autonomous system that we treat as a black box is operating as expected, right? A very complicated black box, one that can do a great many things, right? How do we know that it is working as we expect it to? What is it that we expect it to do?
These are existential questions that need to be answered in some capacity, and it needs to be answered not only in the context of a use case, but also in the more general context of what do we as humans expect out of AI? That's sort of the level at which we're operating as an industry and as sort of researchers. It's can we answer both questions in one fell swoop? Are we perhaps asking the same set of questions just with different parameters?
John: Well, that was beautifully put. I guess the only thing I would really add would be, at its core, we're using a lot of the same words while there is different meaning between them, AI reliability, safety, trust, et cetera. Ultimately, at its core, all of these things are just fundamentally a different form of measurement. It's understanding. Can we as humans, in whatever context that we apply these new tools, can we understand what's going on?
Without that fundamental reasoning ability, that understanding, that measurement, which LLMs and the transformer models bring a unique complexity to that space, we can't have all of these other things that are so critical to, not only does my credit card work better, but all the way to the more heady and scary stuff Shayan mentioned around how we as society are interacting with AI and what do humans do in a post-AI world. How do we trust these systems that we are giving more and more and more of our not only information and data, but even our daily interactions into, we need to be able to build that trust.
Lilly: There's a whole lot more that we could talk about here. Unfortunately, we are out of time for it. I want to thank you both so much for joining me for this episode of the Thoughworks Technology Podcast. Have a great day.
John: Thanks so much.
Shayan: Thank you. Thank you for having us.
John: It was fun.