Setup environment
By default, the evaluation framework uses yourlocalhost development environment (watsonx Orchestrate Developer Edition). You can use a .env file to change your environment with the --env-file flag. To do that, you must set up your .env file similar to how you install the watsonx Orchestrate Developer Edition.
Most of the evaluation framework capabilities depend on model-proxy or wx.ai models. If you use the Dallas data center, everything works fine. But if you need to run the framework with model-proxy and a WO_INSTANCE that points to a non-Dallas region, you can supply the model override flag:
- For all regions except London, Tokyo, and Toronto
- For London, Tokyo, and Toronto
Agent Pre-Deployment Lifecycle
The agent development process before deployment is iterative and consists of four key stages:- Develop: Build and configure the agent.
- Evaluate: Measure performance against reference data.
- Analyze: Inspect agent behavior and identify patterns or issues.
- Improve: Refine the agent based on evaluation results.
Evaluation process
Agent evaluation is driven by a user story, which provides all necessary context for the agent to perform tool-based tasks. A user story consists of all the necessary context and goals that the user expects to achieve in a specific interaction with the agent. A good user story includes user information, as well as the description of the goal and overall context of the interaction. During the evaluation process, the evaluation framework simulates a user interaction with the target agent, which is the agent being evaluated. This new interaction is driven by a user agent that is powered by an LLM and follows the instructions given in the user story. The user agent uses the context of the user story to send messages to the target agent. The responses returned by the target agent are compared to the expected responses in the user story, and if they are an exact match, the target agent successfully passes the evaluation.Success criteria
A trajectory is a sequence of actions taken by an agent in response to a user query. A trajectory is considered successful when:- The agent performs all required tool calls in the correct order with the expected input parameters.
- The agent generates a high-quality summary that effectively addresses the user’s query.
Journey Success metric.
Ground truth datasets
Ground truth datasets are used to benchmark and evaluate agent performance. These datasets are structured in JSON format and include:- Full conversation context
- Tool call sequences
- Dependency graphs
- Expected final summaries
| Field | Type | Description |
|---|---|---|
agent | string | Name of the agent associated with this dataset |
goals | object | Dependency map of goal identifiers to arrays of dependent goals |
goal_details | array | Detailed list of tool calls and final summary response |
story | string | Narrative description of the storyline of the data |
starting_sentence | string | The user’s first utterance |
-
agent- Type:
string - Description: The name of the agent involved in the conversation. This is used to associate the dataset with a specific agent. To find the agent name, you can call
orchestrate agents list
- Type:
-
goals- Type:
object - Description: Represents a goal dependency graph (a map of goal identifiers to arrays of dependent goals.). This encodes the logical or causal order in which tool calls should occur. Each key is a unique goal name or step in the process (e.g.,
"get_assignment_id_hr_agent-1"). Each value is an array of goals that depend on completion of this goal.
This means that the agent must get the Assignment ID before it gets the timeoff schedule. - Type:
-
goal_details- Type:
array - Description: A step-by-step list of tool calls and final text response made by the agent. Each object includes details about the action, such as type, name, arguments, and (for the final summary response) the actual text and extracted keywords.
"tool_call": A specific tool call, with arguments specified inargs."text": A text response, including the full response and key terms inkeywords.
type:"tool_call"or"text"name: Action or step name (unique in each dataset)tool_name: (For tool calls) Name of the tool/function invokedargs: (For tool calls) Dictionary of arguments passed to the toolresponse: (For summary step) The agent’s natural language responsekeywords: (For summary step) Key terms in the response
- Type:
-
story- Type:
string - Description: A free-text narrative summary of the user’s intent, key variables, and the main storyline of the data.
- Type:
-
starting_sentence- Type:
string - Description: The first user utterance or the main initiating statement of the chat session.
- Type:

