Evaluating agents and tools

Before you begin

In order to evaluate, you must first have your ground truth dataset defined. To create your dataset, see Creating evaluation dataset.

Evaluating

The evaluate command lets you test and benchmark your agents using ground truth datasets. These datasets can be created automatically (with the generate or record commands) or prepared manually. The evaluation process measures your agent’s performance against the expected outcomes defined in these datasets.

The system simulates user interactions described in your datasets.
The agent’s responses are compared step-by-step with the ground truth dataset.
Any mistakes or deviations from the expected trajectory are logged.
At the end, a summary table of metrics is displayed and saved as a CSV.

Prerequisites

You must import all agents and tools before running the evaluation.
Ensure that you have set up the environment properly. For more information, see Configuring your environments.

`evaluate`

To evaluate your agent, use:

orchestrate evaluations evaluate --test-paths path1,path2 --output-dir output_directory

You can also run evaluation using a YAML config file, giving you full control over all parameters.

orchestrate evaluations evaluate --config-file path/to/config.yaml

Sample config file:

test_paths:
  - benchmarks/wxo_domains/rel_1.8_mock/workday/data/
auth_config:
  url: http://localhost:4321
  tenant_name: local
output_dir: "test_bench_data3"
enable_verbose_logging: true
llm_user_config:
  user_response_style:
  - "Be concise in messages and confirmations"

Note:
After any evaluation run, a results/config.yml file is generated, capturing all used parameters. This can serve as a template for future runs.

--test-paths: Comma-separated list of test files or directories containing ground truth datasets.
--output-dir: Directory where evaluation results will be saved.
--config-file (Optional): Path to the configuration file with details about the datasets and the output directory.

Understanding the Summary Metrics Table

At the end of the evaluation, you will see a summary similar to the following one:

This table is also saved as a CSV file at results/summary_metrics.csv.

Metrics Explained

Agent with Knowledge Summary Metrics

Metric	Description	Calculation / Type
Average Response Confidence	Calculates an average of the confidence that the responses actually answer the user’s question	Float (≥ 0.0)
Average Retrieval Confidence	Calculates an average of the confidence that the retrieved document is relevant to the user’s query	Float (≥ 0.0)
Average Faithfulness	Calculates an average of how close the responses match the the values in the knowledge base	Float (≥ 0.0)
Average Answer Relevancy	Calculates how relevant the answers are based on the knowledge base queries	Float (≥ 0.0)
Number Calls to Knowledge Base	Total number of knowledge bases called	Integer (≥ 0)
Knowledge Bases Called	Names of the knowledge bases that are called	Text

Agent Metrics

Metric	Description	Calculation / Type
Total Steps	Total messages/steps in the conversation	Integer (≥ 0)
LLM Steps	Assistant responses (text/tool calls)	Integer (≥ 0)
Total Tool Calls	Total number of tool calls made	Integer (≥ 0)
Tool Call Precision	Calculates the number of correct tool calls divided by the total number of tool calls made	Float (≥ 0.0)
Tool Call Recall	Determines if the agent called the right tools in the right order.	Float (≥ 0.0)
Agent Routing Accuracy	Determines if the agent reroutes to the expected agents. If there’s no agent routing in the simulation, the default value is `0.0`	Integer (≥ 0)
Text Match	Determines if the final response is similar and accurate to the expected response	Categorical
Journey Success	Considers if the agent made the agent calls in the correct order, matching all the established criteria in the `goal_details`	Boolean
Avg Resp Time (Secs)	Average response time for agent responses	Float (≥ 0.0)

If the value is equal to 1.0 or True, the table omits the result.

Per-Dataset Detailed Results

In the results/messages directory, you will find detailed analysis files for each dataset:

<DATASET>.messages.json: Raw messages exchanged during simulation.
<DATASET>.messages.analyze.json: Annotated analysis, including mistakes and step-by-step comparison to ground truth.
<DATASET>.metrics.json: Metrics summary for that specific test case.

Always verify that your API credentials are set before running evaluate.
Review per-dataset result files for deep insight into agent performance and error patterns.
Tune config parameters as needed for different evaluation scenarios.
Use summary metrics for quick benchmarking, but always check details for full understanding.

Release Notes

Get Started

Environments

Agents

Connections

Tools

Large Language Models (LLMs)

Evaluation Framework

Knowledge Bases

Webchats

Tutorials

API's reference

Legal notices

Before you begin

Evaluating

Prerequisites

`evaluate`

Understanding the Summary Metrics Table

Metrics Explained

Per-Dataset Detailed Results

Release Notes

Get Started

Environments

Agents

Connections

Tools

Large Language Models (LLMs)

Evaluation Framework

Knowledge Bases

Webchats

Tutorials

API's reference

Legal notices

​Before you begin

​Evaluating

​Prerequisites

​evaluate

​Understanding the Summary Metrics Table

​Metrics Explained

​Per-Dataset Detailed Results

Before you begin

Evaluating

Prerequisites

`evaluate`

Understanding the Summary Metrics Table

Metrics Explained

Per-Dataset Detailed Results