Skip to main content

Before you begin

In order to evaluate, you must first have your ground truth dataset defined. To create your dataset, see Creating evaluation dataset.
Note:
  • Starting from version 1.12.0, this command now works with SaaS and on-premises offerings of watsonx Orchestrate. You can now run your evaluations in remote instances, instead of just using the watsonx Orchestrate Developer Edition.
  • If you’re using Inference Frameworks Manager (IFM) in an on-premises environment, you likely need to set the MODEL_OVERRIDE environment variable to enable the evaluation feature. For more information, see Evaluations on CPD.

Evaluating

The evaluate command lets you test and benchmark your agents using ground truth datasets. These datasets can be created automatically (with the generate or record commands) or prepared manually. The evaluation process measures your agent’s performance against the expected outcomes defined in these datasets.
  • The system simulates user interactions described in your datasets.
  • The agent’s responses are compared step-by-step with the ground truth dataset.
  • Any mistakes or deviations from the expected trajectory are logged.
  • At the end, a summary table of metrics is displayed and saved as a CSV.

Prerequisites

  1. You must import all agents and tools before running the evaluation.
  2. Ensure that you have set up the environment properly. For more information, see Configuring your environments.

evaluate

To evaluate your agent, use:
orchestrate evaluations evaluate --test-paths path1,path2 --output-dir output_directory
You can also run evaluation using a YAML config file, giving you full control over all parameters.
orchestrate evaluations evaluate --config-file path/to/config.yaml
Sample config file:
test_paths:
  - benchmarks/wxo_domains/rel_1.8_mock/workday/data/
auth_config:
  url: http://localhost:4321
  tenant_name: local
output_dir: "test_bench_data3"
enable_verbose_logging: true
llm_user_config:
  user_response_style:
  - "Be concise in messages and confirmations"
n_runs: 2   # evaluations will run 2 times
Starting with version 1.13.0:
  • Evaluations can be repeated multiple times to assess performance robustness. This is configured by setting the n_runs parameter in the YAML configuration file. If n_runs is not specified, the evaluation defaults to a single run.
  • Tool call arguments are automatically normalized during evaluation. This ensures that minor differences in formatting — such as capitalization, key order, value types, or list order — no longer result in mismatches.
For SaaS and On-premises only:
On tenant_name, use the name of the environment that you used when you added the environment with the orchestrate env add command.
--config (-c)
string
Path to the configuration file with details about the datasets and the output directory.
--test-paths (-p)
list[string]
Comma-separated list of test files or directories containing ground truth datasets.
--output-dir (-o)
string
Directory where evaluation results will be saved.
--env-file (-e)
string
Path to the .env file that overrides the default environment.
After any evaluation run, a results/config.yml file is generated, capturing all used parameters. This can serve as a template for future runs.

Understanding the Summary Metrics Table

At the end of the evaluation, you will see a summary similar to the following one: Summary metrics table This table is also saved as a CSV file at results/summary_metrics.csv.

Metrics explained

Agent with Knowledge Summary Metrics
MetricDescriptionCalculation / Type
Average Response ConfidenceCalculates an average of the confidence that the responses actually answer the user’s questionFloat (≥ 0.0)
Average Retrieval ConfidenceCalculates an average of the confidence that the retrieved document is relevant to the user’s queryFloat (≥ 0.0)
Average FaithfulnessCalculates an average of how close the responses match the the values in the knowledge baseFloat (≥ 0.0)
Average Answer RelevancyCalculates how relevant the answers are based on the knowledge base queriesFloat (≥ 0.0)
Number Calls to Knowledge BaseTotal number of knowledge bases calledInteger (≥ 0)
Knowledge Bases CalledNames of the knowledge bases that are calledText
Agent Metrics
MetricDescriptionCalculation / Type
RunsNumber of evaluation runs for the dataset, as defined by the n_runs parameter.Integer (≥ 1)
Total StepsTotal messages/steps in the conversationInteger (≥ 0)
LLM StepsAssistant responses (text/tool calls)Integer (≥ 0)
Total Tool CallsTotal number of tool calls madeInteger (≥ 0)
Tool Call PrecisionCalculates the number of correct tool calls divided by the total number of tool calls madeFloat (≥ 0.0)
Tool Call RecallDetermines if the agent called the right tools in the right order.Float (≥ 0.0)
Agent Routing AccuracyDetermines if the agent reroutes to the expected agents. If there’s no agent routing in the simulation, the default value is 0.0Integer (≥ 0)
Text MatchDetermines if the final response is similar and accurate to the expected responsePercentage (0–100%)
Journey SuccessConsiders if the agent made the agent calls in the correct order, matching all the established criteria in the goal_detailsBoolean
Avg Resp Time (Secs)Average response time for agent responsesFloat (≥ 0.0)
If the value is equal to 1.0 or True, the table omits the result.

Per-Dataset Detailed Results

In the results/messages directory, you will find detailed analysis files for each dataset. For single-run evaluations, the following files are generated:
  • <DATASET>.messages.json: Raw messages exchanged during simulation.
  • <DATASET>.messages.analyze.json: Annotated analysis, including mistakes and step-by-step comparison to ground truth.
  • <DATASET>.metrics.json: Metrics summary for that specific test case.
For multi-run evaluations (when the n_runs parameter is set to more than 1), separate files are generated for each run using a run-indexed naming pattern:
  • <DATASET>.run<runN>.messages.json
  • <DATASET>.run<runN>.messages.analyze.json
  • <DATASET>.run<runN>.metrics.json
Here, <runN> represents the run number (for example, run1, run2, and so on).
  • Always verify that your API credentials are set before running evaluate.
  • Review per-dataset result files for deep insight into agent performance and error patterns.
  • Tune config parameters as needed for different evaluation scenarios.
  • Use summary metrics for quick benchmarking, but always check details for full understanding.
Note:You evaluate an external agent’s performance the same way as a native agent. When you review the final summary table from the evaluate command, focus only on the Text Match and Journey Success columns. Native tool calls aren’t involved, so no other columns apply.