Large language model evaluation: The key to GenAI success

Parag Mahajani

Published: September 18, 2024

Large language models (LLMs) are at the forefront of innovation, particularly in the realm of generative AI (GenAI). Yet, as organizations race to adopt these models, a significant challenge emerges — evaluating whether these LLMs are performing as intended and avoiding undesirable outcomes. This task, known as LLM evaluation (LLM evals) is one of the biggest hurdles organizations face when moving generative AI proofs of concepts (POCs) to production.

In our recent Tech Horizons executive webinar, experts from NVIDIA, UBS Investment Bank and Thoughtworks discussed the challenges and potential solutions for organizations looking to harness LLMs effectively.

Major challenges the industry is facing today

The “demo is easy, production is hard” problem

A primary challenge in deploying LLMs is ensuring they consistently deliver accurate and reliable outputs. Aaron Erickson, senior engineering manager at NVIDIA, pointed out that while the initial demos may seem impressive, with capabilities like those of ChatGPT that exude “great executive presence,” the reality in production can be starkly different. Even though ninety percent accuracy might seem good, in most real-world use cases, that’s simply not good enough for production. The model can prove “dangerous” even if it is wrong ten percent of the time. Therefore, implementing effective evaluation methods for LLMs is a crucial aspect of ensuring their reliability and performance.

The black box conundrum

While the concern regarding the “black box” nature of AI has been a topic of discussion for a long time, the emergence of LLMs has put this issue in a new light. As Shayan Mohanty, head of AI research at Thoughtworks, stated, unlike earlier AI machine learning models, which were easily accessible and could be closely monitored, today’s LLMs are often accessed through APIs, making them far less transparent. For the first time, organizations are faced with truly opaque systems that they are not permitted to monitor — and therefore cannot fully understand — creating practical dangers as they strive to harness their capabilities.

Balancing innovation with regulation

For highly regulated industries like financial services the stakes are even higher. Musa Parmaksiz, head of AI and data center excellence at UBS, highlighted that beyond technical accuracy, there is a need to comply with stringent regulatory standards around what can be put in production. However, these frameworks were not designed with LLMs in mind, posing a unique set of challenges as regulators and industry experts collaborate to find a middle ground.

When complexity meets autonomy

As Musa put it, “with the introduction of GenAI and large language models, it’s a different paradigm.” The complexity of modern LLMs alone, which possess trillions of parameters, make it extremely difficult for organizations to interpret what these models are doing, and in turn, trust their output. In addition, the growing trend of these models making decisions without human intervention creates fresh challenges in ensuring the safe and ethical use of AI.

Potential solutions for LLM evaluation

While the field of LLM evaluation is still nascent, there are several promising avenues that offer hope for more reliable and robust models:

Trusting the “judge”

One popular strategy for enhancing the reliability of LLMs involves using one model to evaluate another. This method isn’t entirely autonomous and often requires some degree of human oversight to ensure accuracy and reliability. In this process, trust and intuition play crucial roles in selecting which model will act as the “judge”. The chosen model evaluates the output of another, providing a layer of scrutiny that can enhance confidence in the results. However, the success of this strategy largely depends on the quality of both the judging and judged models, as well as the ability of the overseeing human to interpret and act on the evaluations correctly.

Breaking down the black box

Carlos Villela, senior software engineer at NVIDIA, advocated a pragmatic approach to LLM evaluation, one that focuses on understanding and controlling the LLM’s outputs. This involves treating the LLM as a black box and constraining its outputs as much as possible. When the team wanted the model to do more, instead of growing the available actions that were presented to the black box, they built more black boxes. This created a system of agents collaborating in a group chat rather than one agent that is prompted to do everything.

Geometrical and topological insights

Based on his research, Shayan also highlighted geometrical and topological approaches to show promising results in the field of LLM evals. By applying geometrical and topological methods to investing in high dimensional spaces, we may be able to evaluate an LLM by using embedded text and exploring the properties of the lower dimensional shapes they lie on. For instance, it may be possible to define “appropriate” and “inappropriate” responses from an LLM based on where in the high dimensional space they happen to embed.

The topological analysis may also tell us what the model is likely doing as a black box. These types of metrics may be fed back to the model to indicate whether it is “on right track” without requiring human intervention. This way, GenAI applications may:

Work nearly autonomously and independent of human intervention
Implement higher level guardrails (requiring less heuristic definition by humans)
Build confidence in the minds of stakeholders.

The path forward to production

In light of the current industry challenges and limitations of LLM evaluation, a question remains: what’s the path forward to ensuring the reliability and effectiveness of these models? Our AI experts shared these insights:

Demonstrate continuous improvement in LLM accuracy through rigorous evaluation, using techniques like fine tuning to demonstrate that the model is clearly improving.
Acknowledge that LLMs are probabilistic systems, not deterministic ones. This means that organizations need to re-assess their baseline and consider what’s an acceptable risk level, much like the approach used in self-driving cars. As Aaron said, “If an autonomous system is making decisions, one of the first things you’re going to do is to have a human in the loop as it starts to get better and better. You only get to full autonomy once you’ve been able to validate that it’s at a threshold better enough than humans.”
Balance the need for innovation with regulation. For the first time, regulators in the financial services industry are collaborating with industry players to come to a common understanding on problems that can be solved with LLMs, and identify cases where organizations still need to do additional testing and research.

Investing in LLM success

The ultimate success of AI applications hinges on effective LLM evaluation. While many challenges in LLM evals remain unsolved, progress is being made. Despite these hurdles, organizations have the unique opportunity to start exploring potential solutions in this space. By adopting a proactive approach to evaluation, organizations can mitigate risks and ensure the successful deployment of their AI solutions.

Want to go deeper? Watch the full webinar on demand.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights

Large language model evaluation: A key to GenAI success