Before you begin
In order to evaluate, you must first have your ground truth dataset defined. To create your dataset, see Creating evaluation dataset.Note:
- Starting from version 1.12.0, this command now works with SaaS and on-premises offerings of watsonx Orchestrate. You can now run your evaluations in remote instances, instead of just using the watsonx Orchestrate Developer Edition.
- If you’re using Inference Frameworks Manager (IFM) in an on-premises environment, you likely need to set the MODEL_OVERRIDE environment variable to enable the evaluation feature. For more information, see Evaluations on CPD.
Evaluating
Theevaluate command lets you test and benchmark your agents using ground truth datasets. These datasets can be created automatically (with the generate or record commands) or prepared manually. The evaluation process measures your agent’s performance against the expected outcomes defined in these datasets.
- The system simulates user interactions described in your datasets.
- The agent’s responses are compared step-by-step with the ground truth dataset.
- Any mistakes or deviations from the expected trajectory are logged.
- At the end, a summary table of metrics is displayed and saved as a CSV.
Prerequisites
- You must import all agents and tools before running the evaluation.
- Ensure that you have set up the environment properly. For more information, see Configuring your environments.
evaluate
To evaluate your agent, use:
- Evaluations can be repeated multiple times to assess performance robustness. This is configured by setting the
n_runsparameter in the YAML configuration file. Ifn_runsis not specified, the evaluation defaults to a single run. - Tool call arguments are automatically normalized during evaluation. This ensures that minor differences in formatting — such as capitalization, key order, value types, or list order — no longer result in mismatches.
For SaaS and On-premises only:
On
On
tenant_name, use the name of the environment that you used when you added the environment with the orchestrate env add command.Flags
Flags
Path to the configuration file with details about the datasets and the output directory.
Comma-separated list of test files or directories containing ground truth datasets.
Directory where evaluation results will be saved.
Path to the
.env file that overrides the default environment.results/config.yml file is generated, capturing all used parameters. This can serve as a template for future runs.
Understanding the Summary Metrics Table
At the end of the evaluation, you will see a summary similar to the following one:
results/summary_metrics.csv.
Metrics explained
Agent with Knowledge Summary Metrics| Metric | Description | Calculation / Type |
|---|---|---|
| Average Response Confidence | Calculates an average of the confidence that the responses actually answer the user’s question | Float (≥ 0.0) |
| Average Retrieval Confidence | Calculates an average of the confidence that the retrieved document is relevant to the user’s query | Float (≥ 0.0) |
| Average Faithfulness | Calculates an average of how close the responses match the the values in the knowledge base | Float (≥ 0.0) |
| Average Answer Relevancy | Calculates how relevant the answers are based on the knowledge base queries | Float (≥ 0.0) |
| Number Calls to Knowledge Base | Total number of knowledge bases called | Integer (≥ 0) |
| Knowledge Bases Called | Names of the knowledge bases that are called | Text |
| Metric | Description | Calculation / Type |
|---|---|---|
| Runs | Number of evaluation runs for the dataset, as defined by the n_runs parameter. | Integer (≥ 1) |
| Total Steps | Total messages/steps in the conversation | Integer (≥ 0) |
| LLM Steps | Assistant responses (text/tool calls) | Integer (≥ 0) |
| Total Tool Calls | Total number of tool calls made | Integer (≥ 0) |
| Tool Call Precision | Calculates the number of correct tool calls divided by the total number of tool calls made | Float (≥ 0.0) |
| Tool Call Recall | Determines if the agent called the right tools in the right order. | Float (≥ 0.0) |
| Agent Routing Accuracy | Determines if the agent reroutes to the expected agents. If there’s no agent routing in the simulation, the default value is 0.0 | Integer (≥ 0) |
| Text Match | Determines if the final response is similar and accurate to the expected response | Percentage (0–100%) |
| Journey Success | Considers if the agent made the agent calls in the correct order, matching all the established criteria in the goal_details | Boolean |
| Avg Resp Time (Secs) | Average response time for agent responses | Float (≥ 0.0) |
If the value is equal to 1.0 or
True, the table omits the result.Per-Dataset Detailed Results
In theresults/messages directory, you will find detailed analysis files for each dataset.
For single-run evaluations, the following files are generated:
<DATASET>.messages.json: Raw messages exchanged during simulation.<DATASET>.messages.analyze.json: Annotated analysis, including mistakes and step-by-step comparison to ground truth.<DATASET>.metrics.json: Metrics summary for that specific test case.
<DATASET>.run<runN>.messages.json<DATASET>.run<runN>.messages.analyze.json<DATASET>.run<runN>.metrics.json
<runN> represents the run number (for example, run1, run2, and so on).
Note:You evaluate an external agent’s performance the same way as a native agent. When you review the final summary table from the
evaluate command, focus only on the Text Match and Journey Success columns. Native tool calls aren’t involved, so no other columns apply.
