Skip to main content

Running Evaluations on QA Agents

To begin running evaluations on our QA Agent, we'll need two things:

  1. An EvaluationDataset populated with the generated answers (actual_output) and the retrieval_context from the knowledge base for each question.
  2. A selection of metrics to evaluate our EvaluationDataset on.

In the previous section, we defined AnswerRelevancy and Faithfulness as our metrics of choice. While we already have an EvaluationDataset, it has not yet been populated with these parameters. Our first step is to generate the actual outputs and retrieval contexts for each Golden for evaluation.

info

You'll need to login to Confident AI before you begin running evaluations. To do so, run the following command in your CLI:

deepeval login

Populating the Goldens

Before we populate the Goldens, we need access to them. We'll do this by pulling the synthetic dataset we generated from Confident AI, providing its unique alias:

note

By default, auto_convert_goldens_to_test_cases is set to True, but we'll set it to False for this tutorial since the actual output is a required parameter in an LLMTestCase, and we haven't generated them yet.

from deepeval import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull("MadeUpCompany QA Dataset", auto_convert_goldens_to_test_cases=False)

Next, we’ll iterate through each Golden's inputs and generate the actual_output and retrieval_context for each synthetic user query.

from deepeval.test_case import LLMTestCase

for golden in dataset.goldens:
# Compute actual output and retrieval context
actual_output = qa_agent.generate(golden.input) # Replace with logic to compute actual output
retrieval_context = qa_agent.get_retrieval_context() # Replace with logic to compute retrieval context

dataset.add_test_case(
LLMTestCase(
input=golden.input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
)

Running Evaluations

With our dataset ready, we can begin running evaluations. Pass the populated dataset along with the selected metrics to the evaluate function. Additionally, log the hyperparameters defined in the introduction section to benchmark them against other sets of hyperparameters in future iterations.

from deepeval import evaluate

...
evaluate(
dataset,
metrics=[answer_relevancy_metric, faithfulness_metric],
hyperparameters={"model": model, "prompt template": prompt_template, "top-k": top_k}
)