Tool Correctness
The tool correctness metric is an agentic LLM metric that assesses your LLM agent's function/tool calling ability. It is calculated by comparing whether every tool that is expected to be used was indeed called.
The ToolCorrectnessMetric
allows you to define the strictness of correctness. By default, it considers matching tool names to be correct, but you can also require input parameters and output to match.
Required Arguments
To use the ToolCorrectnessMetric
, you'll have to provide the following arguments when creating an LLMTestCase
:
input
actual_output
tools_called
expected_tools
Example
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCallParams, ToolCall
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra cost.",
# Replace this with the tools that was actually used by your LLM agent
tools_called=[ToolCall(name="WebSearch"), ToolCall(name="ToolQuery")],
expected_tools=[ToolCall(name="WebSearch")]
)
metric = ToolCorrectnessMetric(evaluation_param=ToolCallParams.TOOL)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
There are seven optional parameters when creating a ToolCorrectnessMetric
:
- [Optional]
threshold
: a float representing the minimum passing threshold, defaulted to 0.5. [Optional]evaluation_params
: A list ofToolCallParams
indicating the strictness of the correctness criteria. For example, supplying a list containingToolCallParams.NAME
andToolCallParams.INPUT_PARAMETERS
, but excludingToolCallParams.OUTPUT
, will consider a tool correct if the tool name and input parameters match, even if the output does not. Defaults to a list with one element:[ToolCallParams.NAME]
. - [Optional]
include_reason
: a boolean which when set toTrue
, will include a reason for its evaluation score. Defaulted toTrue
. - [Optional]
strict_mode
: a boolean which when set toTrue
, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
. - [Optional]
should_consider_ordering
: a boolean which when set toTrue
, will consider the ordering in which the tools were called in. For example, ifexpected_tools=[ToolCall(name="WebSearch"), ToolCall(name="ToolQuery"), ToolCall(name="WebSearch")]
andtools_called=[ToolCall(name="WebSearch"), ToolCall(name="WebSearch"), ToolCall(name="ToolQuery")]
, the metric will consider the tool calling to be correct. Only available forToolCallParams.TOOL
and defaulted toFalse
. - [Optional]
should_exact_match
: a boolean which when set toTrue
, will required thetools_called
andexpected_tools
to be exactly the same. Available forToolCallParams.TOOL
andToolCallParams.INPUT_PARAMETERS
and Defaulted toFalse
.
Since should_exact_match
is a stricter criteria than should_consider_ordering
, setting should_consider_ordering
will have no effect when should_exact_match
is set to True
.
How Is It Calculated?
The ToolCorrectnessMetric
, unlike all other deepeval
metrics, are not calculated using any models or LLMs, and instead via exact matching between the expected_tools
and tools_called
parameters.
The tool correctness metric score is calculated according to the following equation:
This metric assesses the accuracy of your agent's tool usage by comparing the tools_called
by your LLM agent to the list of expected_tools
. A score of 1 indicates that every tool utilized by your LLM agent were called correctly according to the list of expected_tools
, should_consider_ordering
, and should_exact_match
, while a score of 0 signifies that none of the tools_called
were called correctly.
If exact_match
is not specified and ToolCall.INPUT_PARAMETERS
is included in evaluation_params
, correctness may be a percentage score based on the proportion of correct input parameters (assuming the name and output are correct, if applicable).