AI testing, benchmarks and evals

Podcast host Lilly Ryan | Podcast guest John Singleton and Shayan Mohanty

January 23, 2025 | 36 min 03 sec

Listen on these platforms

Brief summary

Generative AI's popularity has led to a renewed interest in quality assurance — perhaps unsurprising given the inherent unpredictability of the technology. This is why, over the last year, the field has seen a number of techniques and approaches emerge, including evals, benchmarking and guardrails. While these terms all refer to different things, grouped together they all aim to improve the reliability and accuracy of generative AI.

To discuss these techniques and the renewed enthusiasm for testing across the industry, host Lilly Ryan is joined by Shayan Mohanty, Head of AI Research at Thoughtworks, and John Singleton, Program Manager for Thoughtworks' AI Lab. They discuss the differences between evals, benchmarking and testing and explore both what they mean for businesses venturing into generative AI and how they can be implemented effectively.

Learn more about evals, benchmarks and testing in this blog post by Shayan and John (written with Parag Mahajani).

Episode transcript

Lilly Ryan: Welcome to the Thoughtworks Technology Podcast. I'm your host, Lilly Ryan, and I'm speaking to you from Wurundjeri Country in Australia. Today, we'll discuss benchmarks, evals, tests, and what it really comes down to, the renewed interest and investment in quality assurance and testing that's been sparked by businesses' attempts to put generative AI-backed solutions into production. To guide us in that discussion, we are talking to Shayan Mohanty and John Singleton. Shayan is head of AI research and John is program manager at the Thoughworks AI lab. Both are former co-founders of Watchful. Welcome.

John Singleton: Hey.

Shayan Mohanty: Thank you so much for having us.

John: Looking forward to it.

Lilly: Could you tell our listeners a bit about who you folks are, your background in the industry, and what you're working on at Thoughworks?

John: Yes, I'll kick it off. Like Shayan, we came into Thoughworks under the acquisition of Watchful, or the company that we co-founded, originally starting off to help automate the process of labeling data. We operated that for about five and a half, six years, and are now part of Thoughworks as of almost to the day, actually to the day, eight months. Huzzah, this is our eight-month anniversary. It's been super exciting.

Prior to Watchful, I've done a number of startups, sales, marketing, and operations roles, even worked with the inventor of selective laser sintering, 3D printing, Dr. Carl Deckard for a period of time, making ink for 3D printers. Now I am principal program manager here at Thoughworks, helping proliferate all the good work and manage all the amazing work that Shayan and the research team are heading up.

Shayan: Yes. I am Shayan. I'm the previous CEO and co-founder of Watchful, along with John. Did all the stuff there that you'd expect, tried to build a company, did the thing, built a product, got the company sold. Whoo, and now we work on really cool stuff at Thoughworks, obviously. Before all of that, I used to work at Facebook. I led the stream processing team that ended up building all the ads network infrastructure for all of the ads network calculation stuff that needs to happen for all the advertising partners on the order of 5 petabytes of data per day.

Then after that, I built what's called the MAD team, which is machine learning AI data. That was sort of to sit between FAIR, the Facebook AI Research Lab, and applied machine learning. All that say, a long history in distributed systems land meets machine learning and AI. Now super focused on AI research and moving the AI industry forward. Just to maybe tee up the conversation here, our work is very centered around interpretability, evaluation, and just assessing reliability of GenAI-based systems.

Lilly: I like that way of putting it because it cuts through a lot of the buzzwords that we've got kicking around, benchmarks, evals, tests. There's quite a lot of hype around all of these kinds of things. One thing I like about the show is that we can cut through it and talk about how we might apply something in practice. From your point of view, what's the difference between benchmarks, evals, and testing, and how do people go about putting them to practice?

Shayan: I wanted to first start with testing, because I think that's the most obvious foundation that everyone knows and loves. Obviously, I'm going to paint with a super broad brush here. When you talk about testing, you're usually talking about something that's, in some way, validating or invalidating some behavior. There is a known good type of behavior that you're literally testing for, and there's sort of a pass/fail criteria in some capacity on these tests. That's sort of the assumption that you build around things like CI/CD, and so on.

This is just normal software testing that everyone knows and loves. Again, notwithstanding a whole bunch of other types of testing, which are all useful in their own rights. Again, just in the spirit of talking about shapes, you can think of testing as for validating broadly. Then I want to talk about benchmarks, because that is sort of the thing that everyone jumps to when you think about "testing" for GenAI.

Lilly: Can this model pass the bar?

Shayan: Yes, exactly. It's like, can this model pass the bar? It could also just be like, you have some AI salesperson knocking on your inbox and they're like, "Hey, I've got the new hotness and model and it beat Glue, it beat this, it beat that, it beat such and such, it outperforms Llama 405B on such and such benchmark." There's a whole industry now that has been manufactured around the idea of gaming benchmarks.

We should be very clear about what a benchmark actually is. It is a data set, ostensibly. It's a data set with inputs and some criteria around scoring and output. That's the whole thing. The thing about benchmarks is that, by definition, they, are in some way, open, meaning people have access to that data. Meaning, in theory, if you were to want to outperform such and such benchmark, this is not to say you can do this with every benchmark, but for most benchmarks that are out there, what you can do is you can just make sure that that data is included in your training set in some way. That, in some ways, guarantees a high performance in some capacity on that benchmark.

We should think about like what benchmarks are actually good for that. That's not to just hand wave and be like, "Oh, benchmarks are completely useless." They are often useful tools. Just to be very clear, there are two core limitations to benchmarks. One is that you can't over-rely on the model's ability to perform on a benchmark just because, again, it's so easy to game. The second thing is that just because a model performed well on a benchmark does not necessarily mean that that same model will perform well on your use case, right?

I'm being very specific about my choice in words here when I say model versus an application, right, because when we talk about performance, we have to also consider the backdrop of the application itself. All the ways that you might manipulate a context window, the retrieval part of RAG, for instance. We have to start thinking about all of that in a broader context when we start talking about reliability, because it's not just the model that matters.

When you think about benchmarks, again, with the broad brush here, you should generally think about model to model comparison. If testing is for validating, benchmarks are for comparing, let's say, then what are evals, right? What are some examples? Once again, I keep caveating with this, but very, very broad brush.

There's so much nuance and so much gray area here, but just in the spirit of keeping things simple, in the evals world, and when we talk about evals, we generally talk about metrics or some sort of score or something like that, and some large number. There are some number that describe the properties of the thing you're trying to build. These could be directly against a model, this could also be directly against an application.

You could be measuring properties about an application where a model is part of it, or you could be testing the model itself, like it just depends on what you're trying to do. There are intrinsic and extrinsic metrics, okay? Intrinsic are things rouge, blue, like scores that you can calculate independent of knowing anything about the use case, anything about really anything, right? Perplexity is a metric. You don't have to know anything about, is this model trying to classify? Is it not? What's the application trying to do? You don't even know any of that stuff. These are things that you can always very easily calculate.

Extrinsic are things like user feedback as part of an application structure. Maybe you have a thumbs up, thumbs down type of user interaction. You produce some output, and then the user says, "No, that was a bad output." That's extrinsic evaluation that you can then include as part of an internal evaluation data set, for instance. I want to be clear here. Testing is broadly for validating. Pass-fail criteria. That assumes that you know something about what good looks like.

One of the hard parts about building with LLMs is that you don't always know what good looks like, or at least you know it when you see it, and you can't really articulate how exactly it's good or bad or something like that without, again, seeing it. Testing is for validating. Benchmarks are for comparing. Evals are for understanding.

Before you're able to assert anything, you probably want to just understand what is the normal sort of set of parameters, what metrics are normal for you to see. What's abnormal? What does that look like from the perspective of almost these features that you've tooled in? You can always build tests on top of evals, right? You can have a metrics calculation step in CI/CD. Then you can test against that. You can be like, "Okay, normal operating range for perplexity needs to be between here and here." You only know that because you spent some time basically profiling your application and typical use cases.

Lilly: That's something that I think is really worth looking at because when people are talking about benchmarks and evals and testing in all of these cases and the terminology gets confused, I really appreciate the clear delineation between each of these three terms that you've spelled out for us. When that comes up in a business context, folks don't always think about how it applies to their context specifically. I think that there are a lot of folks who are looking to generative AI as a way of helping them understand their own context as if they haven't spent that much time or haven't really understood the value of looking at their own context in a way that makes it easy to measure.

When it comes to how you might apply this, not just at the level of the model itself, and you've spoken about a range of different applications for this, but when you're looking at it in the context of a whole business, and John, this is a question I have for you in particular, what do you think people do need to be thinking about for their own business in order to make these kinds of metrics effective?

John: I would say that in the world of GenAI, that absolutely nothing has changed. That you need to approach this just as you would any other application, whether it be GenAI or otherwise that you would look to be bringing into your business. You need to establish a way to understand what ROI looks like, what success looks like, and adhere to those metrics and understand that to its core before you start looking for a problem with this new tool.

I think that a lot of people are looking at GenAI right now as a panacea, and this is a thing that they can instantly drop in for a wide variety of problems. While at the surface, that is fundamentally true, when we start talking about bringing applications to production, the bar and expectations for doing so, it's expensive in enterprise. We're not talking about you or I doing some brainstorming or producing some boilerplate code. We're talking about potentially thousands of people, hundreds of thousands, or even millions of users, a level of exposure and risk.

I like using this phrase, and for the non-Americans, I've learned this maybe isn't Americanism. You have to determine, is the juice worth the squeeze? The fundamental thing about evaluations and LLM evals, and it's worth maybe setting a little stage here. I feel we went pretty deep, pretty quickly into definitions, like why do we care so much about this? Why is this such a fundamentally hard problem? Is that businesses have lost in the realm of machine learning models, have lost a toolkit, a toolkit to measure and understand model behavior when things were working, when they weren't.

When you had a constrained output space, a classifier, for instance, classifying this as good or bad, or this should go to accounting versus going to repairs. You could use tools like accuracy, recall, precision, F1 scores to get a good understanding of model performance at any given point in time. Now, when we have a virtually infinite output space using this new tool, our friend, the transformer model in large language models, we now lose a lot of the same guarantees.

Not only that, the ways in which we measure performance of a particular one task that an LLM may be performing, and then to the entire applications output, we have to make sure that all of these things are brutally adhering to those KPIs that we have defined and cared about and actually make the "juice worth the squeeze". We have to have a robust framework that is transparent and that we all agree upon to understand how we're measuring performance within that context.

Lots of words for focus, really, on what is the actual key problem in using the right tools at the right time and understanding that if you choose this tool, which is a great fit for a set of problems today in the enterprise, that comes with a new set of muscles and new set of actions that you need to learn to adopt this into production.

Lilly: I'm always interested in the ways that people get this wrong because with my role in security, I'm always looking at those edge cases and the failure cases and those kinds of things. We're talking about moving into production, and I've seen quite a lot of apps that have reached the proof of concept stage that don't go beyond that point. There's, I think, a lot we can learn from the failures of those proofs of concept that can really inform what does make it to production. What should people be looking at when it comes to what goes wrong, and where do you actually see it going wrong?

[00:15:11] Shayan: I'm going to jump in with a couple of things. When you think about evals at the moment, when you think about what the industry is pointing to and what is currently being done, there are some intrinsic metrics, things like perplexity that are being used. However, I think people's intuitive understanding of how to use those metrics is not always in line with like reality.

For instance, perplexity is often used as a metric of uncertainty. In practice, if, for instance, you're predicting in one token or very, very small token outputs, perplexity is really a measure over things like vocabulary size and a couple other things. Practically speaking, you might end up with a metric that's not really that interpretable or you're interpreting the wrong way.

To John's point, nothing's changed. We've had a very long and storied history as an industry of people misinterpreting metrics like accuracy, precision, recall, F score, and so on, so on. There's nothing new there, but it's just yet another thing to consider. It's just like, do you know what these metrics actually do, what they're actually measuring? That's point one, and that's an obvious thing.

A second thing, which is still obvious, but perhaps less obvious, is thinking about where trust starts and ends in a system. I think LLM as a judge has become a common new way of evaluating LLM outputs and things like that. I'm not saying LLM as a judge is foundationally wrong or anything like that. It's just like, again, you have to consider, does that give you the level of coverage and reliability that you need for the given application? In a world where you have a black box, basically evaluating another black box, how have you determined that you can trust that particular assessment? There's that.

I think the third is on disambiguating preference versus desired outcome, if that makes sense. Let's say that you have an extrinsic evaluation, like your users have indicated that they thumbed down a bunch of responses. What does that actually mean? Does it mean that those outputs are factually incorrect, perhaps the thing that you're trying to guard against? Does it mean that they just didn't like the way it was formatted, for instance? Just collecting the right level of information from these different sources is super important.

All that to say, if you squint at this, this is looking and sounding a lot more like a data science problem, right? It's almost shifting away from classical engineering of testing and pass-fail criteria and stuff like that. Now we're going into many layers of required interpretation of these metrics and like what they really mean in practice, which again, should not be surprising because AI has historically been a data science dominated field. It makes sense that the thing that comes right after, like understanding of these AI systems at the moment lives in data science land. I would just call those three things out. John, do you have anything else you want to add to that?

John: I think it was pretty comprehensive.

Shayan: Yes, thank you.

Lilly: One word that we haven't brought into this discussion yet, but does get used a lot in the context of evals and benchmarks is guardrails. Where do you see those sitting in this suite of buzzwords, and what is the practical application? How are they different? Why are they not part of what you're looking at when it comes to testing?

Shayan: It's a very, very good question. Practically speaking, you can think of guardrails as remediation, right? For instance, in the case of a lot of AI applications, not only do you want to know about failures, ideally very, very early, but in a lot of cases, you won't know, you won't be able to necessarily predict all failures. You're exploring an infinite space at that point and you're just not going to get that far.

Practically speaking, you try and do as much evaluation as you possibly can. You try and get as much coverage as you possibly can, but there's obviously a sort of diminishing return to that. The next thing that you do is you say, "Okay, I know I can't possibly predict every possible future," but what I can do is I can say, "When we put this thing into practice in production, basically like a sidecar, for instance, to the model or the application that keeps it on the straight and narrow."

Now, the reason why that's not testing or part of the evaluation or measuring and so on is because that's not what it does, it remediates. If something breaks such and such rules or guidelines or guardrails, if you will, then we basically kill the entire chain. We just say, no, don't do that. We perhaps try again. We could force a retry. We could try and inject safety stuff into it. There's a wide variety of different shapes of remediation right now.

I will say though, that from a research perspective, we are interested in understanding this entire topology, if you will. We believe that evaluating and measuring is just one side of the coin for remediation. The question that we have in our mind is, can we use the same apparatus both for measuring as for remediation in various ways? As you make your evaluation more and more comprehensive, can you also then make your guardrails more and more comprehensive in the same ways? That's sort of one direction that we're going with.

Another thing that we're trying to get a better sense of is like, what if guardrails were not a sidecar? Is it possible to just put it as part of the model architecture in some way or something a little bit closer to the actual underlying operating mechanisms of the model? Such that you get richer ability to guard, right? In the case of guardrails today, oftentimes, what you're talking about is a stack of regexs, maybe coupled with some other string based rules, and maybe possibly also coupled with like another LLM that is involved in some process. You get something that's like, in a lot of ways, like duct tape and bubble gum, just like a Frankenstein's monster of a sidecar.

Whereas we think that there may be something more holistic that can be done. What if we can target it in the case of a white box model, for instance, you're running something like Llama. What if we can target the specific circuit of activations in the model that causes certain types of behaviors that we find to be undesirable? Then we just dial that down. Then no matter what the model does, it won't ever activate that particular transformer circuit, which means that we won't get the undesirable outputs just as an example. We think it's all part of the same question, but I think at an industry level, those two things haven't really been connected yet.

Lilly: What you're talking about there is a really specific research question that I find pretty exciting. If somebody wants to go and apply this thing in their context at this point of time, is that something that can be pulled out, or is it an evolving space? Where should we be looking for these kinds of conversations, and where can we be exploring this in our own context?

Shayan: On that particular topic, A, no, unfortunately, there's nothing off the shelf that can be pulled and there's no magic yet. I think a lot of people are working on it, not just us. This field broadly, people call it mechanistic interpretability and it has various jumping off points. Obviously, it's not just about interpretability, it's also about alignment in a lot of ways. How do we align models to sort of the outcomes that we desire? Those are the broad keywords.

Now, obviously, you should look to us as well. I'm hoping that we're going to put out some good research in this space. Obviously, it's going to take a little bit of time. Other than that, Anthropic's Mechanistic Interpretability Research Group is doing really interesting stuff. I would also say that EleutherAI, which is more of a cohort of researchers, less a company, more a loose research group, Eleuther AI has been putting out some phenomenal mechanistic interpretability and alignment stuff recently. They're worth looking at as well.

In terms of contextualizing, that's going to be an exercise for the reader, unfortunately. All of these things are so specific, and we're barely scratching the tip of the iceberg in terms of what is even happening under the hood with these models, that anything past just looking at existing research is itself a research topic.

Lilly: John, from the stuff that you've seen, from the folks that you've spoken to, what approaches are people taking to get these conversations started when it comes to understanding how a model or a set of models or a set of architectures can work in their context? In some cases, getting the buy-in from the people around them at their businesses and organizations to make it a conversation that is understood at all levels.

I realize we've been speaking a lot here about some fairly low level detail. We also know that we need to extrapolate that and look at the high level business context if we're going to apply it in a way that makes sense and is effective. How are folks starting these conversations, and where would you recommend that people begin if they're at the beginning of this journey, if they listen to this going, "Yes, this is it. This is what we need." Where do we start with that?

John: Ultimately, I think we need to recognize that the buyer for AI has changed. What AI is being called today has changed from what it was, let alone 5, but certainly 10 plus years ago. Now we have line of business owners, product managers, even that have significant budgets and a lot of wherewithal to select tools and tech stacks for what may or may not work. Ultimately, they have a lot of power.

I think it's important to recognize that bringing those business stakeholders along that may not have the depth and breadth of experience within ML and an understanding of what even the art of the possible looks like today is probably one of the most important things. I think one of my favorite things to say is, "No, I don't think that's reasonable."

Basically, if you can have, in our context, the client, but in the listener's context, maybe your coworker or your particular manager or something like that is getting a deep understanding of what is the business context, what is the application that we're looking to actually solve here, and then what is the level of trust and reliability, to Shayan's point earlier, that's necessary to bring this into production? Then comparing that against what is your existing risk profile, how do you measure and evaluate that today.

Making sure that these business stakeholders who, again, don't have the necessary technical chops or deep hands-on experience, getting them to understand what is and isn't possible despite whatever they played with on ChatGPT the night before and came in with a new project for everybody for the next quarter. Really just, I feel like a broken record, making sure that everybody's brought along and aligned on business KPIs that actually matter. Then squaring that up to what's actually reasonable and possible, and effectively measuring and evaluating the risk for that particular application.

Lilly: I mentioned at the top of the show that there's this renewed interest and investment in quality assurance and testing and all of these kinds of things. We know that we are talking about evals, about benchmarks, about all of these things. In a coordinated fashion, it's risen up out of the adjacent possible to become what it is at the moment. What do you think is really important about why this is happening at this moment? What does it mean for the industry? What does it mean for the questions we're grappling with at the moment? Where do you think we're going?

Shayan: Okay, there are a couple pieces to this. One is not to overblow it too hard here, but there's sort of an AI existential threat, maybe two. There's like an existential threat to the industry. Then, for the doomers out there, so to speak, there's the existential threat that AI poses the humanity, right?

I say that one a little bit tongue in cheek, but practically speaking, we really do need to think about the ways in which we as humans interpret that an AI system is working as expected. What does that actually mean? How do we bring that to the average human user, a non-data science expert who can't reason about metrics like perplexity and so on, but really do want to understand, is there potentially an issue with the last thing that the LLM said to me, or ideally, before the LLM does something wrong or the AI does something wrong, ideally that gets caught, or I have some window into how reliable my system is.

Let me sort of start with the existential impact to the industry. Obviously there's a ton of dollars being spent at various levels in the AI industry as a whole at the moment. All the way up and all the way down the stack, like at the very, very bottom, you've got chip manufacturers. Arguably, there are folks even under that, but let's just start there. Very large and powerful companies now. You've got model developers. Think about all the various vendored models that currently exist and who is building them and how much money they've all raised. There's an insane number of dollars that have all been collectively pooled for that type of thing.

Then you've got AI, like vertical AI companies, right? Companies utilizing AI for specific use cases in specific industries. If you think about that spectrum and, obviously, a missing pieces of a more holistic picture here, but at least you look at those three segments. They command many, many billions, hundreds of billions of dollars currently at the moment, exceeding, I don't want to promise anything, but probably nearing, if not exceeding a trillion at the moment, if we're including market cap, right? That's sort of what we're looking at, at the moment.

Now, what do evals, benchmarks, and tests really get you? In a perfect world, it's trust, right? It's that a human or a set of humans can trust that a system is working as expected. Now, at the moment, we have a ton of POCs that are being built, a ton of different experiments that are being tried, a bunch of tiny little credit card swipes and not a whole lot of floods of cash that are coming in from specific use cases and specific ROI.

We have this holding pattern of POC to production. That's not to say that all of this is because we haven't yet figured out what the correct metrics are and how to measure AI overall. That's certainly part of it. We, as people who build things like POCs, we can put a finger in the air and do a litmus test, if you will, a little vibe check and be like, "Oh, this looks generally right." We lack the conviction that this is ready for production because it might be risky in various ways. The risk is, at the moment, very difficult to quantify.

What does this really mean? That means that we've got a ton of things that are basically waiting on the sidelines before they can really be put into production, before ROI can be realized. Now, it's not just that people need to get the conviction to put it into production. It's once it's in production, it needs to be tuned. It needs to be corrected for behaviors. Those behaviors need to be identified. There's an iteration loop that is also missing that isn't really talked about, but validation or, at minimum, evaluation is part of that. Think about classical REPLs, for instance, in software engineering. You can't really loop unless you've evaluated, for instance. All of that is happening.

Now, at an industry level, let's say that this doesn't get fixed. All POCs remain as POCs. Things don't move to production. That means that AI can't be used for the most important problems in the world where theoretically it could be used for things like massive drug discovery, things around reinventing interesting fraud detection schemes. There's a wide variety of different interesting use cases that might require deep context and really deep integration with AI that just won't come to fruition. ROI won't be seen. That means that a bunch of companies that have raised a ton of money and the people that they raise that money from are all out on the streets, if you will. There's sort of an existential risk at an industry level.

Then, again, at the very end of this, we've got the relationship between humanity and AI. I don't want to overblow this, but there is something to be said about how do we make sure that a largely autonomous system that we treat as a black box is operating as expected, right? A very complicated black box, one that can do a great many things, right? How do we know that it is working as we expect it to? What is it that we expect it to do?

These are existential questions that need to be answered in some capacity, and it needs to be answered not only in the context of a use case, but also in the more general context of what do we as humans expect out of AI? That's sort of the level at which we're operating as an industry and as sort of researchers. It's can we answer both questions in one fell swoop? Are we perhaps asking the same set of questions just with different parameters?

John: Well, that was beautifully put. I guess the only thing I would really add would be, at its core, we're using a lot of the same words while there is different meaning between them, AI reliability, safety, trust, et cetera. Ultimately, at its core, all of these things are just fundamentally a different form of measurement. It's understanding. Can we as humans, in whatever context that we apply these new tools, can we understand what's going on?

Without that fundamental reasoning ability, that understanding, that measurement, which LLMs and the transformer models bring a unique complexity to that space, we can't have all of these other things that are so critical to, not only does my credit card work better, but all the way to the more heady and scary stuff Shayan mentioned around how we as society are interacting with AI and what do humans do in a post-AI world. How do we trust these systems that we are giving more and more and more of our not only information and data, but even our daily interactions into, we need to be able to build that trust.

Lilly: There's a whole lot more that we could talk about here. Unfortunately, we are out of time for it. I want to thank you both so much for joining me for this episode of the Thoughworks Technology Podcast. Have a great day.

John: Thanks so much.

Shayan: Thank you. Thank you for having us.

John: It was fun.

View less

More episodes

Episode name

Published

What does an AI strategy with humans at the center look like?

October 16, 2025

What we're talking about when we talk about context engineering

October 02, 2025

Mean time to shared understanding: Bridging the gap between citizen developers and developers

September 18, 2025

Organizational design and Team Topologies after AI

September 04, 2025

Context engineering: Tackling legacy systems with generative AI

August 21, 2025

Navigating AI opportunities at MYOB

August 07, 2025

Caring about documentation in the LLM era

July 24, 2025

Why the tech industry needs Expert Generalists

July 10, 2025

The three new fallacies of distributed computing

June 26, 2025

MCP and SRE: Why the future of IT operations is agent-driven

June 12, 2025

Unpacking Google I/O 2025

May 29, 2025

Accelerating mainframe modernization using generative AI

May 15, 2025

Exploring the fundamentals of software engineering

May 01, 2025

Themes in Technology Radar Vol.32

April 17, 2025

We need to talk about vibe coding

April 02, 2025

Infrastructure as code in 2025

March 20, 2025

How fitness functions can help us govern and measure AI

March 06, 2025

Architecture as code

February 19, 2025

Decoding DeepSeek

February 06, 2025

AI testing, benchmarks and evals

January 23, 2025

Exploring the intersections of software architecture

January 09, 2025

Who should make software architecture decisions?

December 26, 2024

Generative AI's uncanny valley: Problem or opportunity?

December 12, 2024

Using generative AI for legacy modernization

November 28, 2024

Data contracts: What are they and why do they matter?

November 14, 2024

Themes from Technology Radar Vol.31

October 17, 2024

Build Your Own Radar: Using the Technology Radar as a governance tool

October 03, 2024

Exploring DuckDB: A relational database built for online analytical processing

September 19, 2024

Software service granularity: Getting it right

September 05, 2024

Measuring developer experience

August 22, 2024

How can AI support designers?

August 08, 2024

Sensible defaults: A way to think about our technology practices

July 25, 2024

Tracking technology stacks, practices and experiences across teams

July 11, 2024

Inside Bahmni: An open-source digital public good

June 27, 2024

How to assess your organization's security maturity

June 13, 2024

Continuous delivery vs. continuous deployment: What should be the default?

May 30, 2024

Themes from Technology Radar Vol.30

May 16, 2024

Building at the intersection of machine learning and software engineering

May 02, 2024

Refactoring with AI

April 18, 2024

How to measure your cloud carbon footprint

April 04, 2024

Technology through the Looking Glass: Preparing for 2024 and beyond

March 21, 2024

Diving head first into software architecture

March 07, 2024

Exploring the building blocks of distributed systems

February 22, 2024

Software-defined vehicles: The future of the automotive industry?

February 08, 2024

Beyond the DORA metrics: Measuring engineering excellence

January 25, 2024

Asynchronous collaboration: Getting it right

January 11, 2024

Looking back at key themes across technology in 2023

December 28, 2023

Leveraging generative AI at Bosch

December 14, 2023

Jugalbandi: Building with AI for social impact

November 30, 2023

AI-assisted coding: Experiences and perspectives

November 16, 2023

What's it like to maintain an award-winning open source tool?

November 02, 2023

Engineering platforms and golden paths: Building better developer experiences

October 19, 2023

Managing cost efficiency at scale-ups

October 03, 2023

Exploring SQL and ETL

September 21, 2023

Driving innovation in radio astronomy

September 07, 2023

XR with impact: Building experiences that drive business value

August 24, 2023

Leadership styles in technology teams

August 10, 2023

Making design matter in technology organizations

July 27, 2023

Generative AI and the future of knowledge work

July 13, 2023

Scaling mobile delivery

June 29, 2023

Making privacy a first-class citizen in data science

June 15, 2023

Multi-cloud: Exploring the challenges and opportunities

June 01, 2023

Scaling up at Etsy

May 18, 2023

TinyML: Bringing machine learning to the edge

May 04, 2023

The weaponization of complexity

April 20, 2023

How we put together the Technology Radar

April 06, 2023

Inside India's Drug Discovery Hackathon

March 23, 2023

Serverless in 2023

March 09, 2023

My Thoughtworks journey: Rebecca Parsons

February 23, 2023

How to tackle friction between product and engineering in scale-ups

February 09, 2023

6 key technology trends for 2023

January 26, 2023

Tackling system complexity with domain-driven design

January 12, 2023

Shifting left on accessibility

December 29, 2022

Data Mesh revisited

December 15, 2022

Low-code/no-code platforms: The 10% trap and the limits of abstractions

December 01, 2022

Welcome to the fediverse: Exploring Mastodon, ActivityPub and beyond [Special]

November 24, 2022

Rethinking software governance: Reflecting on the second edition of Building Evolutionary Architectures

November 17, 2022

Reckoning with the force of Conway's Law

November 03, 2022

Exploring the Basal Cost of software

October 20, 2022

Why full-stack testing matters

October 05, 2022

Acknowledging and addressing technical debt in startups and scale-ups

September 22, 2022

XR in practice: the engineering challenges of extending reality

September 08, 2022

Agent-based modelling for epidemiology: EpiRust and BharatSim

August 19, 2022

Mastering architectural metrics

August 12, 2022

Building a culture of innovation

July 28, 2022

Starting out with sensible default practices

July 14, 2022

Better testing through mutations

June 30, 2022

Patterns of legacy displacement — Part two

June 16, 2022

Patterns of legacy displacement — Part one

June 02, 2022

Mitigating cognitive bias when coding

May 19, 2022

Following an usual career path: from dev to CEO

May 05, 2022

Software engineering with Dave Farley

April 21, 2022

Tackling bottlenecks at scale-ups

April 07, 2022

Coding lessons from the pandemic

March 24, 2022

Is there ever a good time for a code freeze?

March 10, 2022

Navigating the perils of multicloud

February 25, 2022

Compliance as a product

February 10, 2022

The big five tech trends for 2022

January 27, 2022

Fluent Python revisited

January 13, 2022

Creating a developer platform for a networked-enabled organization

December 30, 2021

The art of Lean inceptions

December 16, 2021

The hard parts of data architecture

December 02, 2021

TDD for today

November 18, 2021

You can't buy integration

November 04, 2021

The rise of NoSQL

October 21, 2021

The hard parts of software architecture

October 07, 2021

Machine learning in the wild

September 24, 2021

Delivering innovation at scale

September 09, 2021

Jim Highsmith: a 54-year agile journey

August 26, 2021

Securing the software supply chain

August 12, 2021

Making retrospectives effective — and fun

July 22, 2021

Patterns of distributed systems

July 08, 2021

Refactoring databases — or evolutionary database design

June 24, 2021

Making developer effectiveness a reality

June 10, 2021

Team topologies and effective software delivery

May 20, 2021

How green is your cloud?

May 07, 2021

Green software engineering

April 22, 2021

Twenty years of agile

April 08, 2021

Talking with tech leads with Pat Kua

March 25, 2021

My Thoughtworks Journey: Patricia Mandarino

March 11, 2021

Exploring infrastructure as code

February 25, 2021

XR in the enterprise

February 11, 2021

Getting to grips with data visualization

January 21, 2021

Computational notebooks: the benefits and pitfalls

January 07, 2021

The architect elevator

December 24, 2020

The future of Clojure

December 10, 2020

The future of digital trust

November 27, 2020

Integration challenges in an ERP-heavy world — Pt 2

November 12, 2020

Democratizing programming

October 28, 2020

Integration challenges in an ERP-heavy world

October 16, 2020

Models of open sourcing software

October 01, 2020

Applying software engineering practices to data science

September 17, 2020

Using visualization tools to understand large polyglot code bases

September 03, 2020

Machine learning in astrophysics

August 20, 2020

Programming languages geek out

August 06, 2020

Observability does not equal monitoring

July 23, 2020

Working with 50% of code in the browser

July 09, 2020

Realising the full potential of CD

June 25, 2020

Testing the user journey

June 12, 2020

Continuous delivery in the wild

June 01, 2020

Lessons from a remote Tech Radar

May 13, 2020

The future of Python

April 30, 2020

A sensible approach to multi-cloud

April 17, 2020

Digital transformation: a tech perspective

April 02, 2020

IT delivery in unusual circumstances

March 20, 2020

Continuous delivery for today's enterprise

March 06, 2020

Fundamentals of Software Architecture

February 21, 2020

Cloud migration — part two

February 10, 2020

The price of reuse

January 24, 2020

Towards self-serve infrastructure

January 13, 2020

Martin Fowler: my Thoughtworks journey

December 27, 2019

Building an autonomous drone

December 13, 2019

Cloud migration is a journey not a destination

November 28, 2019

Getting to grips with functional programming

November 14, 2019

Compliance as code

November 01, 2019

Data meshes: a distributed domain-oriented data platform

October 18, 2019

Edge — a guide to value-driven digital transformation

October 04, 2019

Tech choices: CIO or CTO?

September 20, 2019

Microservices as complex adaptive systems

September 05, 2019

Supporting the Citizen Developer

August 22, 2019

Getting hands-on with RESTful web services

August 08, 2019

Zhong Tai: innovation in enterprise platforms from China

July 25, 2019

What’s so cool about micro frontends?

July 11, 2019

Unravelling the monoglot monopoly

June 27, 2019

Breaking down the barriers to innovation

June 13, 2019

Delivering strategic architectural transformation

May 30, 2019

Exploring programming languages via paradigms vs labels

May 16, 2019

Multicloud in a regulated environment

May 03, 2019

Can DevSecOps help secure the enterprise?

April 18, 2019

A11Y — Making web accessibility easier

April 04, 2019

Continuous delivery for modern architectures

March 21, 2019

Delivering developer value through platform thinking

March 07, 2019

Architectural governance: rethinking the Department of ‘No’

February 21, 2019

Serendipitous Events

February 08, 2019

Diving into serverless architecture

January 24, 2019

Seismic Shifts

January 10, 2019

Understanding bias in algorithmic systems

December 28, 2018

Microservices: The State of the Art

December 14, 2018

Evolving Interactions

November 29, 2018

The state of API design

November 15, 2018

How we build the Tech Radar

November 01, 2018

IoT Hardware

October 18, 2018

Continuous Intelligence

October 04, 2018

Distributed systems antipatterns

September 13, 2018

Agile Data Science

August 23, 2018

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights

AI testing, benchmarks and evals

Brief summary

Episode transcript

Explore a snapshot of today's technology landscape