Before you begin

In order to evaluate, you must first have your ground truth dataset defined. To create your dataset, see Creating evaluation dataset.

Evaluating

The evaluate command lets you test and benchmark your agents using ground truth datasets. These datasets can be created automatically (with the generate or record commands) or prepared manually. The evaluation process measures your agent’s performance against the expected outcomes defined in these datasets.

  • The system simulates user interactions described in your datasets.
  • The agent’s responses are compared step-by-step with the ground truth dataset.
  • Any mistakes or deviations from the expected trajectory are logged.
  • At the end, a summary table of metrics is displayed and saved as a CSV.

Prerequisites

  1. You must import all agents and tools before running the evaluation.
  2. Ensure that you have set up the environment properly. For more information, see Configuring your environments.

evaluate

To evaluate your agent, use:

orchestrate evaluations evaluate --test-paths path1,path2 --output-dir output_directory

You can also run evaluation using a YAML config file, giving you full control over all parameters.

orchestrate evaluations evaluate --config-file path/to/config.yaml

Sample config file:

test_paths:
  - benchmarks/wxo_domains/rel_1.8_mock/workday/data/
auth_config:
  url: http://localhost:4321
  tenant_name: local
output_dir: "test_bench_data3"
enable_verbose_logging: true
llm_user_config:
  user_response_style:
  - "Be concise in messages and confirmations"

Note:
After any evaluation run, a results/config.yml file is generated, capturing all used parameters. This can serve as a template for future runs.

  • --test-paths: Comma-separated list of test files or directories containing ground truth datasets.
  • --output-dir: Directory where evaluation results will be saved.
  • --config-file (Optional): Path to the configuration file with details about the datasets and the output directory.

Understanding the Summary Metrics Table

At the end of the evaluation, you will see a summary similar to the following one:

This table is also saved as a CSV file at results/summary_metrics.csv.

Metrics Explained

Agent with Knowledge Summary Metrics

MetricDescriptionCalculation / Type
Average Response ConfidenceCalculates an average of the confidence that the responses actually answer the user’s questionFloat (≥ 0.0)
Average Retrieval ConfidenceCalculates an average of the confidence that the retrieved document is relevant to the user’s queryFloat (≥ 0.0)
Average FaithfulnessCalculates an average of how close the responses match the the values in the knowledge baseFloat (≥ 0.0)
Average Answer RelevancyCalculates how relevant the answers are based on the knowledge base queriesFloat (≥ 0.0)
Number Calls to Knowledge BaseTotal number of knowledge bases calledInteger (≥ 0)
Knowledge Bases CalledNames of the knowledge bases that are calledText

Agent Metrics

MetricDescriptionCalculation / Type
Total StepsTotal messages/steps in the conversationInteger (≥ 0)
LLM StepsAssistant responses (text/tool calls)Integer (≥ 0)
Total Tool CallsTotal number of tool calls madeInteger (≥ 0)
Tool Call PrecisionCalculates the number of correct tool calls divided by the total number of tool calls madeFloat (≥ 0.0)
Tool Call RecallDetermines if the agent called the right tools in the right order.Float (≥ 0.0)
Agent Routing AccuracyDetermines if the agent reroutes to the expected agents. If there’s no agent routing in the simulation, the default value is 0.0Integer (≥ 0)
Text MatchDetermines if the final response is similar and accurate to the expected responseCategorical
Journey SuccessConsiders if the agent made the agent calls in the correct order, matching all the established criteria in the goal_detailsBoolean
Avg Resp Time (Secs)Average response time for agent responsesFloat (≥ 0.0)

If the value is equal to 1.0 or True, the table omits the result.

Per-Dataset Detailed Results

In the results/messages directory, you will find detailed analysis files for each dataset:

  • <DATASET>.messages.json: Raw messages exchanged during simulation.
  • <DATASET>.messages.analyze.json: Annotated analysis, including mistakes and step-by-step comparison to ground truth.
  • <DATASET>.metrics.json: Metrics summary for that specific test case.
  • Always verify that your API credentials are set before running evaluate.
  • Review per-dataset result files for deep insight into agent performance and error patterns.
  • Tune config parameters as needed for different evaluation scenarios.
  • Use summary metrics for quick benchmarking, but always check details for full understanding.