Defining an Evaluation Criteria
In the previous section, we generated an entire synthetic QA dataset in just a few seconds. In this section, we'll be using some of those queries to observe how our RAG QA Agent responds, and carve out our what factors we care about in our QA agent's response. This will help us narrow down our evaluation criteria, which we'll need in order to select the right metrics for evaluation.
Revisting the Generated User Queries
Let's examine the first three golden generations and generate responses for them. Since we're only examining three user queries, you can simply copy and paste these input
s from Confident AI into your codebase.
input_1 = "Examine CloudMate's unique enterprise offerings: comprehensive compliance tools and dedicated support services."
input_2 = "If CloudMate introduces an Ultra plan, doubling current storage and security, what pricing might be expected?"
input_3 = "How does MadeUpCompany's adherence to GDPR, HIPAA, and SOC 2 ensure data protection?"
As you can see, these synthetic questions are great because they challenge your QA agent on edge cases and go beyond asking simple questions. For example, input_2
asks about the pricing of a potential new plan offering.
Generating QA Agent Responses for Inspection
We'll pass these input
s into our QA agent and see how it responds:
input_1 = "Examine CloudMate's unique enterprise offerings: comprehensive compliance tools and dedicated support services."
print(qa_agent.generate(input_1))
# CloudMate offers a secure and scalable cloud storage solution designed for businesses
# of all sizes. Its enterprise features include comprehensive compliance tools such as
# military-grade encryption, role-based access control, and multi-factor authentication.
# Additionally, CloudMate provides dedicated support services, including priority customer
# assistance and a dedicated account manager for enterprise clients.
The QA agent answered our first input correctly, providing relevant details about CloudMate's compliance tools and dedicated support services.
input_2 = "Examine CloudMate's unique enterprise offerings: comprehensive compliance tools and dedicated support services."
print(qa_agent.generate(input_2))
# The CloudMate Ultra plan will provide unlimited storage and top-tier security features
# at a fixed cost of $99.99/month.
The QA agent provided a speculative response about a potential Ultra plan. While the answer is reasonable, it lacks a factual basis unless CloudMate has officially announced such a plan.
input_3 = "How does MadeUpCompany's adherence to GDPR, HIPAA, and SOC 2 ensure data protection?"
print(qa_agent.generate(input_3))
# MadeUpCompany ensures data protection through strict adherence to GDPR, HIPAA, and SOC 2 standards.
# This includes robust encryption protocols, access control mechanisms, and regular security audits
# to safeguard sensitive data.
While the response is generally relevant to data protection, it incorrectly assumes knowledge about MadeUpCompany when the QA agent is meant to provide answers specifically about CloudMate. Instead, the system should have either declined to answer or clarified its limitations.
Defining the Evaluation Criteria
By reviewing these three responses, we can start to understand what we care about in our QA agent's replies. While the tone could be more personable, the main issue here is that some of the answers are speculative when asked questions outside the scope of the company's informational knowledge context, and some are irrelevant. These values we care about form our evaluation criteria.
As you inspect more responses from your QA agent, your evaluation criteria will expand, especially when it comes to product and user feedback.
For now, we'll focus on two key criteria:
- The answers must be non-speculative.
- The answers must be relevant.
In the next section, we’ll explore how these evaluation criteria help in selecting the right metrics for assessing our QA agent.