There are two primary methods for generating an evaluation dataset to evaluate your agents:

  • Using the record command
    Captures live chat sessions via the chat UI and converts them into structured datasets. Useful for benchmarking agent behavior across different configurations (e.g., agent descriptions, LLMs).
  • Using the generate command
    Automatically creates ground truth datasets from user stories and tool definitions. Ideal for building realistic and repeatable evaluation scenarios.

Recording user interaction

The record command captures real-time chat interactions and automatically generates evaluation datasets from them.

With recording enabled, any conversation you have in the chat UI will automatically be captured and annotated for evaluation.

You can create as many sessions as needed. Each session’s data will be stored in separate annotated files.

Tip:
Create a new chat session when you want to make new datasets. Using the same chat session for multiple tests can cause issues with the final output.

Workflow

  1. Interact with the agent via the chat UI.
  2. Use the record command to capture the session.
  3. The session is converted into a structured dataset for evaluation.

Prerequisites

1

Launch the Chat UI

First, make sure your chat UI is running. Use the following command to start the chat interface:

orchestrate chat start
2

Access the Chat UI

Once the UI is running, open your browser and navigate to:

http://localhost:3000/chat-lite

Here, you can select the agent you wish to interact with. For example, the image below uses the hr_agent agent:

Start recording your session

To begin recording, run the following command in your terminal:

orchestrate evaluations record --output-dir ./chat_recordings

Arguments:

  • --output-dir (optional): Directory where your recorded data will be saved. If omitted, the data will be saved in your current working directory. For every chat session, the following file is generated in your output directory:
    • <THREAD_ID>_annotated_data.json Annotated ground truth data based on your chat session, ready for evaluation.

Example annotated data

The following is a sample conversation with the hr_agent:

This conversation generates the following annotated data file:

{
    "agent": "hr_agent",
    "goals": {
        "get_assignment_id_hr_agent-1": [
            "get_timeoff_schedule_hr_agent-1"
        ],
        "get_timeoff_schedule_hr_agent-1": [
            "summarize"
        ]
    },
    "goal_details": [
        {
            "type": "tool_call",
            "name": "get_assignment_id_hr_agent-1",
            "tool_name": "get_assignment_id_hr_agent",
            "args": {
                "username": "nwaters"
            }
        },
        {
            "type": "tool_call",
            "name": "get_timeoff_schedule_hr_agent-1",
            "tool_name": "get_timeoff_schedule_hr_agent",
            "args": {
                "assignment_id": "15778303",
                "end_date": "2025-01-30",
                "start_date": "2025-01-01"
            }
        },
        {
            "name": "summarize",
            "type": "text",
            "response": "Your time off schedule is on January 5, 2025.",
            "keywords": [
                "January 5, 2025"
            ]
        }
    ],
    "story": "You want to know your time off schedule. Your username is nwaters. The start date is 2025-01-01. The end date is 2025-01-30.",
    "starting_sentence": "I want to know my time off schedule"
}

Note:

  • The annotated data is generated automatically. Therefore, it is essential to review and, if necessary, edit the data before using it for evaluation purposes. You can also delete any details that are not relevant to your tests.
  • The starting_sentence field is populated directly from your inputs. However, other fields like story and goals are derived from the recorded conversation and might require validation to ensure their accuracy and relevance.

Stopping the recording

When you are done with your session, press Ctrl+C in the terminal running the record command. Be sure to finish your conversation before stopping to avoid generating an incomplete dataset.

Generating user data

The generate command transforms user stories into structured test cases using your tool definitions. It produces datasets suitable for automated evaluation and benchmarking of agents.

Key Features

  • Converts user stories into structured test cases
  • Generates valid tool call sequences based on your tool definitions
  • Outputs datasets for consistent and automated agent evaluation

Prerequisites

Before running the generate command, ensure the following:

  1. Tool Definitions: Define tools in a Python module using the @tool decorator and proper type annotations.
  2. User Stories: Prepare a .csv file containing user stories. Each row should include:
    • story: A natural language description of the user’s goal
    • agent: The name of the agent responsible for handling the story
  3. Environment Setup: Import your tool and agent definitions into the environment where the command will be executed. For more information, see Importing Tools.

Example user stories

story,agent
I want to look up holiday calendar for this year. My email-id is someone@gmail.com,hr_agent

You can find example of stories and tools in the following links:

The generate command

You can run the following command to run the command:

orchestrate evaluations generate --stories-path <path-to-stories> --tools-path <path-to-tools>

Arguments:

  • --stories-path: Path to your CSV file of user stories
  • --tools-path: Path to your Python file defining agent tools
  • --output-dir (optional): Output directory for generated files; if omitted, files are saved alongside your stories file.

The generate command will analyze each story and generate a sequence of tool calls which is saved as an <AGENT_NAME>_snapshot_llm.json file in our output directory.

The snapshot is then used to generate structured test cases that you can use for evaluating your agent(s). The generated datasets are written to a <AGENT_NAME>_test_cases/ folder in the output directory.

Tool requirements

Tool definitions must be provided in a Python file and must follow these requirements:

  • Functions must be top-level (not inside classes)
  • Each tool must use the @tool decorator
  • Each tool must include a descriptive docstring
  • Arguments must be typed (str, int, etc.)
  • Return values must be JSON-serializable (str, list, dict, etc.)

Example:

@tool()
def fetch_assignment_id(username: str) -> str:
    """
    Return the assignment ID for a given employee username.
    :param username: Employee's username
    :return: Assignment ID as a string or 'not found'
    """
    assignment_ids = {
        "nwaters": "15778303",
        "johndoe": "15338303",
        "nken": "15338304"
    }
    return assignment_ids.get(username, "not found")

Note:
The tools provided in this example are mocked and use hardcoded data. If your tools need to make actual API calls, make sure to include the necessary authentication credentials (API keys, tokens, etc.) and proper error handling in your implementation.

Tip:

  • Always verify that your API credentials are set before running generate.
  • Generated datasets serve as ground truth for benchmarking and validating agent behavior.