How to evaluate an LLM system

Jem Joseph Elias and

Karrtik Iyer

Published: March 28, 2025

Evaluating large language model (LLM) based applications is inherently challenging due to the unique nature of these systems. Unlike traditional software applications, where outputs are deterministic and predictable, LLMs generate outputs that can vary each time they are run, even with the same input. This variability arises from the probabilistic nature of these models, which means there is no single correct output for any given input. Consequently, testing LLM-based applications requires specialized evaluation techniques — known today as ‘evals’ — to ensure they meet performance and reliability standards.

Why are evals so critical?

There are a number of reasons AI evals are so important. Broadly speaking, there are four key ways they are valuable:

1. They establish performance standards.

Evaluation helps establish performance standards for LLM systems, guiding the development process by providing directional outcomes for design choices and hyperparameters. By setting benchmarks, developers can measure the effectiveness of different approaches and make informed decisions to enhance the model’s performance.

2. They can help ensure consistent and reliable outputs.

Consistency and reliability are vital for the practical deployment of LLM systems. Regular evaluations help identify and mitigate issues that could lead to unpredictable or erroneous outputs. Ensuring the system produces stable and dependable results builds trust and confidence among users and stakeholders.

3. They provide insight to guide improvement.

Continuous evaluation provides valuable insights into how the LLM system is performing. It highlights areas where the system excels and where it falls short, offering opportunities for targeted improvements. By understanding the strengths and weaknesses of the model, developers can refine and optimize the system for better performance.

4. They enable regression testing.

When changes are made to an LLM system —whether in prompts, design choices or underlying algorithms — regression testing becomes essential. Evaluation ensures that these changes do not deteriorate the quality of the output. It verifies that new updates maintain or enhance the system’s performance, preventing unintended consequences and preserving the integrity of the application.

Evaluating LLM systems can be broadly divided into two categories: pre-deployment evaluations and production evaluations. Each category serves distinct purposes and is crucial at different stages of the development and deployment lifecycle.

Pre-deployment evaluations

Pre-deployment evaluations focus on assessing LLM systems during the development stage. This phase is critical for shaping the performance and reliability of the system before it goes live. Here’s why pre-deployment evaluations are essential:

1. Performance measurement and benchmarking:

During the development stage, evaluating your LLM system provides a clear measure of its performance. By using a variety of metrics and evaluation techniques, developers can benchmark the system’s capabilities. This benchmarking helps in comparing different versions of the model and understanding the impact of various architectural and design choices. By identifying strengths and weaknesses early on, developers can make informed decisions to enhance efficiency, accuracy, and overall performance.

2. Ensuring regression-free updates:

As the system undergoes continuous development, changes in the codebase, model parameters, or data can inadvertently introduce regressions — unintended reductions in performance or accuracy. Regular pre-deployment evaluations help ensure each modification improves or at least maintains performance standards.

How to perform a pre-deployment evaluation

To perform a pre-deployment evaluation, here are the steps you need to follow:

Create a ground truth dataset for evaluation

The first and perhaps most critical step in evaluating LLM systems is creating a robust ground truth dataset. This dataset comprises a set of question-answer pairs generated by expert human users. These essentially serve as a benchmark for evaluating the LLM’s performance.

Ground truth data is essential because it provides a reference point against which the model’s outputs can be compared. It should be representative of the type of questions that end users are likely to ask in production and include a diverse range of possible questions to cover different scenarios and contexts.

Creating ground truth data requires the expertise of human users who have a deep understanding of the business domain and user behaviors. These experts can accurately predict the kinds of questions users will ask and provide the best answers. This level of understanding and contextual knowledge is something LLMs, despite their advanced capabilities, may lack.

Can LLMs create a ground truth?

Can LLMs generate ground truth? While LLMs can assist in generating ground truth data, they should not be solely relied upon for this task. Here’s why:

They don't understand user behavior:

LLMs do not understand user behavior and the specific context of your business domain. They can generate plausible questions and answers, but these may not accurately reflect the types of queries your users will ask or the answers that will be most useful to them.

They need human oversight:

Human experts are necessary to review and refine the questions and answers generated by LLMs. They ensure the dataset is realistic, contextually accurate and valuable for end users.

It's vital to ensure quality and relevance:

The quality of the ground truth dataset is paramount. Human oversight guarantees that the questions and answers are not only relevant but also adhere to the business’s standards and user expectations.

Here is a good example of a ground truth data set for a RAG application. In addition to the query and answer, this data set provides the different passages relevant to the query from the knowledge base.

Identify the relevant metrics for your LLM system

Selecting the appropriate evaluation metric is crucial for assessing the performance of LLM systems. The choice of metric depends on the specific use case of the LLM system, because different applications may require different aspects of the model’s performance to be measured.

Here are some sample evaluation metrics and their definitions:

Answer relevancy

Definition: This metric measures how relevant the provided answer is to the given question. It evaluates whether the response directly addresses the query and provides useful and pertinent information.
Importance: Ensuring that the model’s answers are relevant helps maintain user satisfaction and trust in the system. Irrelevant answers can confuse or frustrate users, diminishing the value of the application.

2. Coherence

Definition: Coherence assesses the logical flow and clarity of the generated text. It checks whether the response is internally consistent and makes sense as a whole.
Importance: Coherent responses are easier for users to understand and follow. This metric is vital for applications where clarity and comprehensibility are essential, such as customer support or educational tools.

3. Contextual relevance

Definition: This metric measures how well the model’s output aligns with the broader context provided. It evaluates whether the response appropriately considers the surrounding text or conversation.
Importance: Contextual relevance ensures that the model’s responses are appropriate and meaningful within the given context. This is critical for maintaining the continuity and relevance of conversations or content.

4. Responsibility metrics

Definition: Responsibility metrics assess the ethical and appropriate nature of the model’s output. This includes checking for biases, harmful content and compliance with ethical standards.
Importance: Ensuring responsible AI usage is crucial to prevent the spread of misinformation, harmful stereotypes, and unethical content. These metrics help build trust and ensure that the LLM system adheres to societal and ethical norms.

5. RAG evaluation metrics

The RAG triad consists of the below metrics:

Generation metrics

Retrieval metrics

Faithfulness

How factually accurate the generated answer is

Context precision

The signal to noise ratio of retrieved context

Answer relevancy

How relevant is the generated answer to the question

Context recall

Can it retrieve all relevant information required to answer the question

Expand table

Collapse table

Task-specific metrics

While the above sample metrics can be used across use cases and tasks, you will need some metrics which are more attuned to the particular task you are performing. Task-specific metrics evaluate the model’s performance on particular tasks, tailored to the specific requirements of the application. Examples include metrics for summarization, translation and sentiment analysis.

Why task-specific metrics are necessary

Most evaluation metrics are generic and provide a broad assessment of the LLM system’s performance. However, to understand how well the system performs specific tasks, custom metrics tailored to the task at hand are needed. Task-specific metrics offer detailed insights into the model’s effectiveness in implementing particular functionalities, ensuring the LLM system meets the unique requirements of each application.

For example, for abstractive summarization tasks, Kryscinski et al. (2019) propose the below:

Metric	Detail
Fluency	How grammatical and readable the summary is
Coherence	How well the summary flows and connects ideas
Consistency	Whether the summary is factually consistent with the source
Relevance	How well the summary captures the key ideas of the source

Expand table

Collapse table

Calculate the score for each of your defined metrics against ground truth

The next step in evaluating your LLM system involves calculating the scores for each defined metric against your ground truth. For each question in your ground truth dataset, use the answer generated by the LLM system to compute the respective metric. If one or more metrics yield unsatisfactory results, make necessary adjustments to your LLM system to improve those metrics. Libraries like DeepEval and Relari-ai use NLP libraries to compare your LLM responses against the ground truth and calculate these metrics. The metrics are calculated by leveraging LLMs, other NLP models or traditional code functions.

It is important to make metric driven decisions about the design of your LLM systems based on the observed metrics. For example, a low recall for questions expecting short factual answers might require you to reduce your chunk sizes. A low precision even at high values of K, might benefit from reranking of your retrieved chunks. Similarly, the different elements of your LLM system workflow such as prompts, inference parameters, chunking strategy, retrieval mechanisms, choice of embeddings etc should be optimized based on the metrics.

There is an emerging trend to use a strong LLM (e.g., GPT-4) as a reference-free (ground truth free) metric to evaluate generations from other LLMs. This is sometimes referred to as “LLM-as-judge.” The G-eval framework is a good example of this. The paper argues that, when used via this framework, GPT 4 has a strong correlation with human evaluators. Similar arguments have been made in papers by Vicuna and QLoRA.

However, the reliability and granularity from using a ground truth for evals is much better than using an evaluator LLM as detailed in this blog. Additionally, some of the metrics like context recall cannot be measured without ground truth.

In conclusion, while LLMs can provide valuable directional insights and help streamline the evaluation process, they cannot completely replace the evaluations obtained from ground truth data. Ground truth evaluations, though labor-intensive, offer unmatched precision and reliability. Combining both approaches can be beneficial.

Make evals part of your deployment process

To ensure your LLM system consistently meets the required performance criteria, it’s essential to integrate evaluations as part of your deployment pipelines. This integration not only validates the model’s performance before deployment but also maintains quality and reliability throughout the development lifecycle.

The tests run automatically with every commit and before a deployment release to ensure that code changes don’t introduce errors or degrade performance. This article talks about writing unit test cases for LLMs in a fair amount of detail.

In addition to running automated tests you have written, tools like Giskard can help run scans as part of your deployment pipelines to test your LLM on multiple aspects like harmfulness, hallucinations, and sensitive information. Below are some examples of automated tests implemented using Giskard to check for hallucinations and harmfulness. Giskard runs these tests as part of your deployment process.

A key point to note about automated tests for LLM systems is that we will need to write tests for the data preprocessing and ingestion stages also.

Post-deployment evaluations and data flywheels

To ensure your LLM system continues to perform optimally after deployment, it’s crucial to implement robust observability layers. These layers provide the necessary traces of input and output interactions, helping you understand where the system may be failing or underperforming.

Monitoring the interactions allows you to capture real-time data on how the LLM system handles different questions and scenarios.
Set up continuous monitoring to detect anomalies and performance issues.
While automated tests provide continuous oversight, they are not sufficient on their own. Human evaluations are essential for capturing nuances and context that automated systems might miss.
Schedule periodic evaluation sessions with domain experts to assess the LLM’s performance and provide feedback for improvement.
Finally, create a feedback mechanism where users can report issues or provide feedback directly from the interface.

Data flywheels and continuous LLM enhancement

The concept of data flywheels is pivotal for the continuous enhancement of LLM systems. A data flywheel is a self-reinforcing loop that leverages data collected from operational environments to drive ongoing improvements in performance. For LLM systems, this translates to using real-time observations and feedback from production environments to refine your workflow, ensuring it becomes more accurate, relevant and effective over time. As discussed earlier, the insights from metrics can be used to make changes to the different components of your workflow ranging from chunking strategy to embeddings to prompts and retrieval methods.

It’s important to note that any metrics you come up with are not static — they may need to be changed and adapted over time. This happens as you learn more about the end user behavior and identify new modes of use in production and failure scenarios.

By understanding how users interact with the system and where misunderstandings or inefficiencies occur, you can adjust your LLM system to be clearer, more specific and better aligned with user intent. Similarly, workflows can be streamlined to reduce friction and improve the overall user experience.

The beauty of the data flywheel approach is its cyclical nature. As you implement improvements based on your observations, these changes will generate new data, offering fresh insights for further refinement. This creates a perpetually improving system that becomes more effective and efficient with each iteration.

Evals first: The key to building reliable LLM applications

To build reliable and high-performing LLM applications, shifting evaluations left in the development workflow isn’t just beneficial — it’s essential. By integrating an evals-driven approach from the start, teams can proactively identify gaps, improve their implementations and ensure alignment with user expectations early on.

The choice of evaluation methods should be guided by the nature of the application and the desired user experience. That means it's crucial to consider what the right metrics and benchmarks are upfront. Don’t treat evals as an afterthought — make them a cornerstone of your development process to build robust, user-centric AI applications.

Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.

View less

Solutions

Industries

Resource Hubs

Publications and Tools

All Insights