There are two primary methods for generating an evaluation dataset to evaluate your agents:
record
commandgenerate
commandThe record
command captures real-time chat interactions and automatically generates evaluation datasets from them.
With recording enabled, any conversation you have in the chat UI will automatically be captured and annotated for evaluation.
You can create as many sessions as needed. Each session’s data will be stored in separate annotated files.
Tip:
Create a new chat session when you want to make new datasets. Using the same chat session for multiple tests can cause issues with the final output.
record
command to capture the session.Launch the Chat UI
First, make sure your chat UI is running. Use the following command to start the chat interface:
Access the Chat UI
Once the UI is running, open your browser and navigate to:
http://localhost:3000/chat-lite
Here, you can select the agent you wish to interact with. For example, the image below uses the hr_agent
agent:
To begin recording, run the following command in your terminal:
Arguments:
--output-dir
(optional): Directory where your recorded data will be saved. If omitted, the data will be saved in your current working directory. For every chat session, the following file is generated in your output directory:
<THREAD_ID>_annotated_data.json
Annotated ground truth data based on your chat session, ready for evaluation.The following is a sample conversation with the hr_agent
:
This conversation generates the following annotated data file:
Note:
starting_sentence
field is populated directly from your inputs. However, other fields like story
and goals
are derived from the recorded conversation and might require validation to ensure their accuracy and relevance.When you are done with your session, press Ctrl+C
in the terminal running the record
command. Be sure to finish your conversation before stopping to avoid generating an incomplete dataset.
The generate
command transforms user stories into structured test cases using your tool definitions. It produces datasets suitable for automated evaluation and benchmarking of agents.
Before running the generate
command, ensure the following:
@tool
decorator and proper type annotations..csv
file containing user stories. Each row should include:
story
: A natural language description of the user’s goalagent
: The name of the agent responsible for handling the storyYou can find example of stories and tools in the following links:
generate
commandYou can run the following command to run the command:
Arguments:
--stories-path
: Path to your CSV file of user stories--tools-path
: Path to your Python file defining agent tools--output-dir
(optional): Output directory for generated files; if omitted, files are saved alongside your stories file.The generate
command will analyze each story and generate a sequence of tool calls which is saved as an <AGENT_NAME>_snapshot_llm.json
file in our output directory.
The snapshot is then used to generate structured test cases that you can use for evaluating your agent(s). The generated datasets are written to a <AGENT_NAME>_test_cases/
folder in the output directory.
Tool definitions must be provided in a Python file and must follow these requirements:
@tool
decoratorstr
, int
, etc.)str
, list
, dict
, etc.)Example:
Note:
The tools provided in this example are mocked and use hardcoded data. If your tools need to make actual API calls, make sure to include the necessary authentication credentials (API keys, tokens, etc.) and proper error handling in your implementation.
Tip:
generate
.