Quick evaluation of agents and tools

The quick-eval command provides a fast, reference-less evaluation of your agents and tools.

Note:For now, you can use only Python tools.

Unlike the standard evaluate command, it does not require ground truth datasets. Instead, it runs a lightweight check to identify common issues such as schema mismatches and hallucinations in tool calls.

orchestrate evaluations quick-eval -p  examples/evaluations/quick-eval/ -o results/ -t examples/evaluations/evaluate/agent_tools

You can also run the quick evaluation using a YAML config file, giving you full control over all parameters.

orchestrate evaluations quick-eval -c examples/evaluations/config.yaml

Sample config file:

config.yaml

test_paths:
  - benchmarks/wxo_domains/rel_1.8_mock/workday/data/
auth_config:
  url: http://localhost:4321
  tenant_name: local
output_dir: "test_bench_data3"
enable_verbose_logging: true
llm_user_config:
  user_response_style:
  - "Be concise in messages and confirmations"

Flags

--config (-c)

string

required

Path to the configuration file with details about the evaluation settings.

--test-paths (-p)

list[string]

Comma-separated list of test files or directories containing ground truth datasets. Required when not using a configuration file.

--tools-path (-t)

string

Directory containing tool definitions.

--output-dir (-o)

string

Directory where evaluation results will be saved. Required when not using a config file.

--env-file (-e)

string

Path to the .env file that overrides the default environment.

More examples in the Examples folder.

Understanding the Summary Metrics Table

At the end of the evaluation, you will see a summary similar to the following one:

Metrics explained

Quick Evaluation Summary Metrics

Metric	Description	Calculation / Type
Dataset	Name of the dataset used for quick evaluation	Text
Tool Calls	Total number of tool calls attempted during the evaluation	Integer (≥ 0)
Successful Tool Calls	Number of tool calls that executed successfully without errors	Integer (≥ 0)
Tool Calls Failed due to Schema Mismatch	Number of tool calls that failed because the input/output schema did not match expectations	Integer (≥ 0)
Tool Calls Failed due to Hallucination	Number of tool calls that failed because the agent invoked tools that were irrelevant or non-existent	Integer (≥ 0)

If the value is equal to 1.0 or True, the table omits the result.

Release Notes

Get Started

Build

Deploy

Analyze

watsonx Orchestrate Developer Edition

watsonx Orchestrate ADK MCP Server

Reference

Legal notices

Understanding the Summary Metrics Table

Metrics explained

Release Notes

Get Started

Build

Deploy

Analyze

watsonx Orchestrate Developer Edition

watsonx Orchestrate ADK MCP Server

Reference

Legal notices

​Understanding the Summary Metrics Table

​Metrics explained

Understanding the Summary Metrics Table

Metrics explained