Running Evaluations on QA Agents
To begin running evaluations on our QA Agent, we'll need two things:
- An
EvaluationDataset
populated with the generated answers (actual_output
) and theretrieval_context
from the knowledge base for each question. - A selection of metrics to evaluate our
EvaluationDataset
on.
In the previous section, we defined AnswerRelevancy
and Faithfulness
as our metrics of choice. While we already have an EvaluationDataset
, it has not yet been populated with these parameters. Our first step is to generate the actual outputs and retrieval contexts for each Golden
for evaluation.
You'll need to login to Confident AI before you begin running evaluations. To do so, run the following command in your CLI:
deepeval login
Populating the Goldens
Before we populate the Goldens, we need access to them. We'll do this by pulling the synthetic dataset we generated from Confident AI, providing its unique alias:
By default, auto_convert_goldens_to_test_cases
is set to True
, but we'll set it to False
for this tutorial since the actual output is a required parameter in an LLMTestCase
, and we haven't generated them yet.
from deepeval import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull("MadeUpCompany QA Dataset", auto_convert_goldens_to_test_cases=False)
Next, we’ll iterate through each Golden
's inputs and generate the actual_output
and retrieval_context
for each synthetic user query.
from deepeval.test_case import LLMTestCase
for golden in dataset.goldens:
# Compute actual output and retrieval context
actual_output = qa_agent.generate(golden.input) # Replace with logic to compute actual output
retrieval_context = qa_agent.get_retrieval_context() # Replace with logic to compute retrieval context
dataset.add_test_case(
LLMTestCase(
input=golden.input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
)
Running Evaluations
With our dataset ready, we can begin running evaluations. Pass the populated dataset along with the selected metrics to the evaluate
function. Additionally, log the hyperparameters defined in the introduction section to benchmark them against other sets of hyperparameters in future iterations.
from deepeval import evaluate
...
evaluate(
dataset,
metrics=[answer_relevancy_metric, faithfulness_metric],
hyperparameters={"model": model, "prompt template": prompt_template, "top-k": top_k}
)