Overview
The evaluation framework in the ADK is designed to assess agent behavior by comparing simulated agent interactions, referred to as trajectories, against a predefined set of reference data. It also includes tools to generate this reference data and supports the entire agent development lifecycle.
Setup environment
By default, the evaluation framework uses your localhost
development environment. You can use a .env
file to change your environment with the --env-file
argument. To do that, you must set up your .env
file similar to how you install the watsonx Orchestrate Developer Edition.
Agent Pre-Deployment Lifecycle
The agent development process before deployment is iterative and consists of four key stages:
- Develop: Build and configure the agent.
- Evaluate: Measure performance against reference data.
- Analyze: Inspect agent behavior and identify patterns or issues.
- Improve: Refine the agent based on evaluation results.
Evaluation process
Agent evaluation is driven by a user story, which provides all necessary context for the agent to perform tool-based tasks.
A user story consists of all the necessary context and goals that the user expects to achieve in a specific interaction with the agent. A good user story includes user information, as well as the description of the goal and overall context of the interaction.
During the evaluation process, the evaluation framework simulates a user interaction with the target agent, which is the agent being evaluated. This new interaction is driven by a user agent that is powered by an LLM and follows the instructions given in the user story. The user agent uses the context of the user story to send messages to the target agent.
The responses returned by the target agent are compared to the expected responses in the user story, and if they are an exact match, the target agent successfully passes the evaluation.
Success criteria
A trajectory is a sequence of actions taken by an agent in response to a user query. A trajectory is considered successful when:
- The agent performs all required tool calls in the correct order with the expected input parameters.
- The agent generates a high-quality summary that effectively addresses the user’s query.
When both conditions are met, the evaluation framework marks the interaction as a successful journey, contributing to the Journey Success
metric.
Ground truth datasets
Ground truth datasets are used to benchmark and evaluate agent performance. These datasets are structured in JSON format and include:
- Full conversation context
- Tool call sequences
- Dependency graphs
- Expected final summaries
The evaluation framework uses this data to score agent behavior and identify areas for improvement.
Example dataset
Field | Type | Description |
---|---|---|
agent | string | Name of the agent associated with this dataset |
goals | object | Dependency map of goal identifiers to arrays of dependent goals |
goal_details | array | Detailed list of tool calls and final summary response |
story | string | Narrative description of the storyline of the data |
starting_sentence | string | The user’s first utterance |
-
agent
- Type:
string
- Description: The name of the agent involved in the conversation. This is used to associate the dataset with a specific agent. To find the agent name, you can call
orchestrate agents list
- Type:
-
goals
- Type:
object
- Description: Represents a goal dependency graph (a map of goal identifiers to arrays of dependent goals.). This encodes the logical or causal order in which tool calls should occur. Each key is a unique goal name or step in the process (e.g.,
"get_assignment_id_hr_agent-1"
). Each value is an array of goals that depend on completion of this goal.
Example:
This means that the agent must get the Assignment ID before it gets the timeoff schedule.
- Type:
-
goal_details
- Type:
array
- Description: A step-by-step list of tool calls and final text response made by the agent. Each object includes details about the action, such as type, name, arguments, and (for the final summary response) the actual text and extracted keywords.
Action Types:
"tool_call"
: A specific tool call, with arguments specified inargs
."text"
: A text response, including the full response and key terms inkeywords
.
Fields for each action:
type
:"tool_call"
or"text"
name
: Action or step name (unique in each dataset)tool_name
: (For tool calls) Name of the tool/function invokedargs
: (For tool calls) Dictionary of arguments passed to the toolresponse
: (For summary step) The agent’s natural language responsekeywords
: (For summary step) Key terms in the response
- Type:
-
story
- Type:
string
- Description: A free-text narrative summary of the user’s intent, key variables, and the main storyline of the data.
- Type:
-
starting_sentence
- Type:
string
- Description: The first user utterance or the main initiating statement of the chat session.
- Type:
Generating datasets
You don’t need to create the datasets manually. The ADK includes commands to generate them automatically depending on your use case. For more information, see Creating evaluation dataset.