LLM as a judge

Technology Radar

Published : Oct 23, 2024

NOT ON THE CURRENT EDITION

This blip is not on the current edition of the Radar. If it was on one of the last few editions, it is likely that it is still relevant. If the blip is older, it might no longer be relevant and our assessment might be different today. Unfortunately, we simply don't have the bandwidth to continuously review blips from previous editions of the Radar. Understand more

Oct 2024

Trial

Many systems we build have two key characteristics: being able to provide an answer based on questions about a large data set, and being next to impossible to follow how it arrived at that answer. Despite this opacity we still want to assess and improve the quality of the responses. With the LLM as a judge pattern we use an LLM to evaluate the responses of another system, which in turn might be based on an LLM. We've seen this pattern used to evaluate the relevance of search results in a product catalog and to assess whether an LLM-based chatbot was guiding its users in a sensible direction. Naturally, the evaluator system must be set up and calibrated carefully. It can drive significant efficiency gains, which, in turn, translates to lower costs. This is an ongoing area of research, with the current state summarized in this article.