In order to evaluate, you must first have your ground truth dataset defined. To create your dataset, see Creating evaluation dataset.
The evaluate
command lets you test and benchmark your agents using ground truth datasets. These datasets can be created automatically (with the generate
or record
commands) or prepared manually. The evaluation process measures your agent’s performance against the expected outcomes defined in these datasets.
evaluate
To evaluate your agent, use:
You can also run evaluation using a YAML config file, giving you full control over all parameters.
Sample config file:
Note:
After any evaluation run, a results/config.yml
file is generated, capturing all used parameters. This can serve as a template for future runs.
--test-paths
: Comma-separated list of test files or directories containing ground truth datasets.--output-dir
: Directory where evaluation results will be saved.--config-file
(Optional): Path to the configuration file with details about the datasets and the output directory.At the end of the evaluation, you will see a summary similar to the following one:
This table is also saved as a CSV file at results/summary_metrics.csv
.
Agent with Knowledge Summary Metrics
Metric | Description | Calculation / Type |
---|---|---|
Average Response Confidence | Calculates an average of the confidence that the responses actually answer the user’s question | Float (≥ 0.0) |
Average Retrieval Confidence | Calculates an average of the confidence that the retrieved document is relevant to the user’s query | Float (≥ 0.0) |
Average Faithfulness | Calculates an average of how close the responses match the the values in the knowledge base | Float (≥ 0.0) |
Average Answer Relevancy | Calculates how relevant the answers are based on the knowledge base queries | Float (≥ 0.0) |
Number Calls to Knowledge Base | Total number of knowledge bases called | Integer (≥ 0) |
Knowledge Bases Called | Names of the knowledge bases that are called | Text |
Agent Metrics
Metric | Description | Calculation / Type |
---|---|---|
Total Steps | Total messages/steps in the conversation | Integer (≥ 0) |
LLM Steps | Assistant responses (text/tool calls) | Integer (≥ 0) |
Total Tool Calls | Total number of tool calls made | Integer (≥ 0) |
Tool Call Precision | Calculates the number of correct tool calls divided by the total number of tool calls made | Float (≥ 0.0) |
Tool Call Recall | Determines if the agent called the right tools in the right order. | Float (≥ 0.0) |
Agent Routing Accuracy | Determines if the agent reroutes to the expected agents. If there’s no agent routing in the simulation, the default value is 0.0 | Integer (≥ 0) |
Text Match | Determines if the final response is similar and accurate to the expected response | Categorical |
Journey Success | Considers if the agent made the agent calls in the correct order, matching all the established criteria in the goal_details | Boolean |
Avg Resp Time (Secs) | Average response time for agent responses | Float (≥ 0.0) |
If the value is equal to 1.0 or True
, the table omits the result.
In the results/messages
directory, you will find detailed analysis files for each dataset:
<DATASET>.messages.json
: Raw messages exchanged during simulation.<DATASET>.messages.analyze.json
: Annotated analysis, including mistakes and step-by-step comparison to ground truth.<DATASET>.metrics.json
: Metrics summary for that specific test case.evaluate
.