Skip to main content

Running Evaluations

As you saw in the quickstart section earlier, a test run on Confident AI is created when you run an evaluation with deepeval. A test run is made up of a list of test cases, containing the input, actual_output, etc. of your LLM application, and each with its own set of metric scores to represent the performance of your LLM application as a whole.

In short, a test run is a benchmark for your LLM application.

  • Evaluating test cases locally on deepeval and sending it to Confident AI after evaluation.
  • Running an evaluation directly on Confident AI.

You can also think of LLM evaluation with Confident AI and deepeval as unit-testing for LLMs.

IMPORTANT

Before you being, you definitely want to familiarize yourself with what a test case is as it is vital for LLM evaluation.

Running An Evaluation

To run an evaluation, the first step is to choose your LLM evaluation metrics, and you can see what's available here (custom metrics are available if non of deepeval's pre-defined metrics resonate with your use case).

note

For an agentic evaluation use case, you can also include LLMTestCase parameters such as tools_called to encapsulate what tools were used during the generation of each actual_output. For this example though, since we're just evaluating a RAG-based medical chatbot with no function/tool calling, we're just going to populate the input, actual_output, and retrieval_context for our LLMTestCase.

For the sake of simplicity, we'll reuse the example in quickstart for demontration purposes, which uses the AnswerRelevancyMetric and FaithfulnessMetric, which requires the input, actual_output, and retrieval_context for each LLMTestCase.

Also, don't forget to login to Confident AI:

deepeval login
experiment_llm.py
import hypothetical_chatbot
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval import evaluate

# Use dataset from last section
dataset = EvaluationDataset()
dataset.pull(alias="QA Dataset")

# Create a list of LLMTestCase
test_cases = []
for golden in dataset.goldens:
input = golden.input
# Replace with your own LLM application
actual_output, retrieval_context = hypothetical_chatbot(input)
test_case = LLMTestCase(
input=input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
test_cases.append(test_case)

# Define metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.5)
faithfulness = FaithfulnessMetric(threshold=0.5)

# Run evaluation
evaluate(test_cases=test_cases, metrics=[answer_relevancy, faithfulness], identifier="First Run")

You'll notice that unlike the quickstart guide, we actually added an identifier argument for this evaluate() call. The identifier is a string which you can supply to identify test runs more easily, because otherwise all you'll be seeing on Confident AI is gibberish test run IDs we've generated for you.

info

To learn about all the parameters the evaluate() accepts, click here.

As you'll see on Confident AI, your second test case is actually failing for the FaithfulnessMetric metric - that's fine, we're going to fix it.

Confident AI
tip

If you click on the View test case details button, you'll be able to see the reasons why the FaithfulnessMetric metric failed, as well as the verbose_logs which details how the metric was calculated.

Running Another Evaluation

For now, lets pretend that we've improved our medical chatbot and here is the newly generated actual_output for the second LLMTestCase after one iteration:

"""If you cut your finger deeply, rinse it with soap and water, apply pressure to stop bleeding,
and cover it with a clean bandage. Seek medical care if it keeps bleeding, is very deep, or shows
signs of infection like redness or swelling. Ensure your tetanus shot is up to date."""

Go ahead and replace the second actual_output with the latest one above, and run evaluations again:

experiment_llm.py
import hypothetical_chatbot
from deepeval import evaluate
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

# Use dataset from last section
dataset = EvaluationDataset()
dataset.pull(alias="QA Dataset")

# Create a list of LLMTestCase
test_cases = []
for golden in dataset.goldens:
input = golden.input
# Don't forget to hardcode the second actual_output!
actual_output, retrieval_context = hypothetical_chatbot(input)
test_case = LLMTestCase(
input=input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
test_cases.append(test_case)

# Define metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.5)
faithfulness = FaithfulnessMetric(threshold=0.5)

# Run evaluation
evaluate(test_cases=test_cases, metrics=[answer_relevancy, faithfulness], identifier="Second Run")

Don't forget to update your identifier and congratulations 🎉! All your test cases are now passing with flying colors.

Comparing Two Test Runs

Now that you've created two test runs, (identifier="First Run" and identifier="Second Run") - which are benchmarks of your LLM application, you can compare test results side by side to identify regressions and improvements.

info

In this example, we only have two test cases, but this can be scaled to thousands of test cases that allows you to quickly identify how does your most recent changes to your LLM application affect its performance.

In your individual Test Run page, navigate to the Compare Test Results page through the side navigation drawer, and select a test run to compare against in the dropdown menu. The rows you when you open the dropdown are the test run IDs displayed for historical test runs, and if you've supplied an identifier during evaluate() you'll be able to easily identify which test run's which.

tip

Confident AI groups your test cases by its name, and so if you want to compare test runs legitimately you should supply a hard-coded name for the same LLMTestCases you want to compare side by side.

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(name="239874", ...)

You can learn more about labelling test cases here.

By selecting one of the historical test runs, you'll be able to see exactly how the metric scores differed across them, and hence working out if your latest change is an improvement or regression, and whether it caused any unacceptable breaking changes. Luckily for us, it didn't.

Testing Reports On Confident AI

All test runs created, on Confident AI or locally via deepeval(ie. evaluate() or deepeval test run), are automatically sent to Confident AI for anyone in your organization to view, share, download, and comment on. Here are the list of things you can do with Confident AI's testing reports:

  • Analyze metric score distributions, averages, and median scores.
  • Inspect test cases, including parameters such as actual_output, retrieval_context, and tools_called, and their associated metric results including score, reasoning, errors (if any), and verbose logs (for debugging your evaluation model's chain of thought).
  • Filter for and select specific test cases for sharing.
  • Download test cases as CSV or JSON (if you're running a conversational test run).
  • Create a dataset from the test cases in a test run.
  • Compare different test runs side by side.
  • Create a public link to share with external stakeholders that might be interested in your LLM evaluation process.
info

Confident AI also keeps track of additional metadata such as time taken for a test run to finish executing, and the evaluation cost associated with this test run.

Logging Models, Prompts, And More

In the previous sections, we saw how to create a test run by dynamically generating actual_output for pulled datasets at evaluation time. But in addition to that, in order to experiemnt with different hyperparameter values to iterate towards the optimal LLM to use for your LLM application for example, you have to associate hyperparameter values to test runs to compare and pick the best hyperparameters on Confident AI.

Continuing from the previous example, if for example llm_app uses gpt-4o as its LLM, you can associate the gpt-4o value as such:

...

evaluate(
test_cases=test_cases,
metrics=[metric],
hyperparameters={"model": "gpt4o", "prompt template": "your-prompt-template-goes-here"}
)
tip

You can run a heavily nested for loop to generate actual_outputs to get test run evaluation results for all hyperparameter combinations.

for model in models:
for prompt in prompts:
evaluate(..., hyperparameters={...})

This will help Confident AI associate which hyperparameter combination were used for each test run results, which you can ultimately easily filter for and visualize which performs best on Confident AI.

ok

In the next section, we'll show how you can setup this workflow in CI/CD pipelines to unit-test your LLM application.