- Using the
recordcommand
Captures live chat sessions via the chat UI and converts them into structured datasets. Useful for benchmarking agent behavior across different configurations (e.g., agent descriptions, LLMs). - Using the
generatecommand
Automatically creates ground truth datasets from user stories and tool definitions. Ideal for building realistic and repeatable evaluation scenarios.
Recording user interaction
Therecord command captures real-time chat interactions and automatically generates evaluation datasets from them.
With recording enabled, any conversation you have in the chat UI will automatically be captured and annotated for evaluation.
You can create as many sessions as needed. Each session’s data will be stored in separate annotated files.
Note:When you work with external agents that act as collaborators of native agents, use the
record command the same way you do with native agents. The key difference is that the generated ground truth data doesn’t include a graph of tool calls in the "goals" and "goal_details" sections.Workflow
- Interact with the agent via the chat UI.
- Use the
recordcommand to capture the session. - The session is converted into a structured dataset for evaluation.
Prerequisites
1
Activate your environment
You must activate an environment before you record your data.For more information about how to add an environment, see Configure access to remote environments.
Starting from version 1.12.0, this command now works with SaaS and on-premises offerings of watsonx Orchestrate. You can now run your evaluations in remote instances, instead of just using the watsonx Orchestrate Developer Edition.
2
Launch the Chat UI
SaaS and On-premises
SaaS and On-premises
Open the watsonx Orchestrate chat URL. For example:
Local (watsonx Orchestrate Developer Edition)
Local (watsonx Orchestrate Developer Edition)
Make sure your chat UI is running. Use the following command to start the chat interface:Once the UI is running, open your browser and navigate to:http://localhost:3000/chat-liteHere, you can select the agent you wish to interact with. For example, the image below uses the 
hr_agent agent:
Start recording your session
To begin recording, run the following command in your terminal:--output-dir(optional): Directory where your recorded data will be saved. If omitted, the data will be saved in your current working directory. For every chat session, the following file is generated in your output directory:<THREAD_ID>_annotated_data.jsonAnnotated ground truth data based on your chat session, ready for evaluation.
Example annotated data
The following is a sample conversation with thehr_agent:

Note:
- The annotated data is generated automatically. Therefore, it is essential to review and, if necessary, edit the data before using it for evaluation purposes. You can also delete any details that are not relevant to your tests.
- The
starting_sentencefield is populated directly from your inputs. However, other fields likestoryandgoalsare derived from the recorded conversation and might require validation to ensure their accuracy and relevance.
Stopping the recording
When you are done with your session, pressCtrl+C in the terminal running the record command. Be sure to finish your conversation before stopping to avoid generating an incomplete dataset.
Generating user data
Thegenerate command transforms user stories into structured test cases using your tool definitions. It produces datasets suitable for automated evaluation and benchmarking of agents.
Note:For now, you can use only Python tools.
Key Features
- Converts user stories into structured test cases
- Generates valid tool call sequences based on your tool definitions
- Outputs datasets for consistent and automated agent evaluation
Prerequisites
Before running thegenerate command, ensure the following:
- Tool Definitions: Define tools in a Python module using the
@tooldecorator and proper type annotations. - User Stories: Prepare a
.csvfile containing user stories. Each row should include:story: A natural language description of the user’s goalagent: The name of the agent responsible for handling the story
- Environment Setup: Import your tool and agent definitions into the environment where the command will be executed. For more information, see Importing Tools.
Example user stories
The generate command
You can run the following command to run the command:
--stories-path: Path to your CSV file of user stories--tools-path: Path to your Python file defining agent tools--output-dir(optional): Output directory for generated files; if omitted, files are saved alongside your stories file.
generate command will analyze each story and generate a sequence of tool calls which is saved as an <AGENT_NAME>_snapshot_llm.json file in our output directory.
The snapshot is then used to generate structured test cases that you can use for evaluating your agent(s). The generated datasets are written to a <AGENT_NAME>_test_cases/ folder in the output directory.
Tool requirements
Tool definitions must be provided in a Python file and must follow these requirements:- Functions must be top-level (not inside classes)
- Each tool must use the
@tooldecorator - Each tool must include a descriptive docstring
- Argu.ments must be typed (
str,int, etc.) - Return values must be JSON-serializable (
str,list,dict, etc.)
Note:
The tools provided in this example are mocked and use hardcoded data. If your tools need to make actual API calls, make sure to include the necessary authentication credentials (API keys, tokens, etc.) and proper error handling in your implementation.
The tools provided in this example are mocked and use hardcoded data. If your tools need to make actual API calls, make sure to include the necessary authentication credentials (API keys, tokens, etc.) and proper error handling in your implementation.

