Running Evaluations on QA Agents
To begin running evaluations on our QA Agent, we'll need two things:
- An
EvaluationDatasetpopulated with the generated answers (actual_output) and theretrieval_contextfrom the knowledge base for each question. - A selection of metrics to evaluate our
EvaluationDataseton.
In the previous section, we defined AnswerRelevancy and Faithfulness as our metrics of choice. While we already have an EvaluationDataset, it has not yet been populated with these parameters. Our first step is to generate the actual outputs and retrieval contexts for each Golden for evaluation.
You'll need to login to Confident AI before you begin running evaluations. To do so, run the following command in your CLI:
deepeval login
Populating the Goldens
Before we populate the Goldens, we need access to them. We'll do this by pulling the synthetic dataset we generated from Confident AI, providing its unique alias:
By default, auto_convert_goldens_to_test_cases is set to True, but we'll set it to False for this tutorial since the actual output is a required parameter in an LLMTestCase, and we haven't generated them yet.
from deepeval import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull("MadeUpCompany QA Dataset", auto_convert_goldens_to_test_cases=False)
Next, we’ll iterate through each Golden's inputs and generate the actual_output and retrieval_context for each synthetic user query.
from deepeval.test_case import LLMTestCase
for golden in dataset.goldens:
# Compute actual output and retrieval context
actual_output = qa_agent.generate(golden.input) # Replace with logic to compute actual output
retrieval_context = qa_agent.get_retrieval_context() # Replace with logic to compute retrieval context
dataset.add_test_case(
LLMTestCase(
input=golden.input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
)
Running Evaluations
With our dataset ready, we can begin running evaluations. Pass the populated dataset along with the selected metrics to the evaluate function. Additionally, log the hyperparameters defined in the introduction section to benchmark them against other sets of hyperparameters in future iterations.
from deepeval import evaluate
...
evaluate(
dataset,
metrics=[answer_relevancy_metric, faithfulness_metric],
hyperparameters={"model": model, "prompt template": prompt_template, "top-k": top_k}
)