The evaluation framework in the ADK is designed to assess agent behavior by comparing simulated agent interactions, referred to as trajectories, against a predefined set of reference data. It also includes tools to generate this reference data and supports the entire agent development lifecycle.

Setup environment

By default, the evaluation framework uses your localhost development environment. You can use a .env file to change your environment with the --env-file argument. To do that, you must set up your .env file similar to how you install the watsonx Orchestrate Developer Edition.

Agent Pre-Deployment Lifecycle

The agent development process before deployment is iterative and consists of four key stages:

  1. Develop: Build and configure the agent.
  2. Evaluate: Measure performance against reference data.
  3. Analyze: Inspect agent behavior and identify patterns or issues.
  4. Improve: Refine the agent based on evaluation results.

Evaluation process

Agent evaluation is driven by a user story, which provides all necessary context for the agent to perform tool-based tasks.

A user story consists of all the necessary context and goals that the user expects to achieve in a specific interaction with the agent. A good user story includes user information, as well as the description of the goal and overall context of the interaction.

During the evaluation process, the evaluation framework simulates a user interaction with the target agent, which is the agent being evaluated. This new interaction is driven by a user agent that is powered by an LLM and follows the instructions given in the user story. The user agent uses the context of the user story to send messages to the target agent.

The responses returned by the target agent are compared to the expected responses in the user story, and if they are an exact match, the target agent successfully passes the evaluation.

Success criteria

A trajectory is a sequence of actions taken by an agent in response to a user query. A trajectory is considered successful when:

  • The agent performs all required tool calls in the correct order with the expected input parameters.
  • The agent generates a high-quality summary that effectively addresses the user’s query.

When both conditions are met, the evaluation framework marks the interaction as a successful journey, contributing to the Journey Success metric.

Ground truth datasets

Ground truth datasets are used to benchmark and evaluate agent performance. These datasets are structured in JSON format and include:

  • Full conversation context
  • Tool call sequences
  • Dependency graphs
  • Expected final summaries

The evaluation framework uses this data to score agent behavior and identify areas for improvement.

Example dataset

{
    "agent": "hr_agent",
    "goals": {
        "get_assignment_id_hr_agent-1": [
            "get_timeoff_schedule_hr_agent-1"
        ],
        "get_timeoff_schedule_hr_agent-1": [
            "summarize"
        ]
    },
    "goal_details": [
        {
            "type": "tool_call",
            "name": "get_assignment_id_hr_agent-1",
            "tool_name": "get_assignment_id_hr_agent",
            "args": {
                "username": "nwaters"
            }
        },
        {
            "type": "tool_call",
            "name": "get_timeoff_schedule_hr_agent-1",
            "tool_name": "get_timeoff_schedule_hr_agent",
            "args": {
                "assignment_id": "15778303",
                "end_date": "2025-01-30",
                "start_date": "2025-01-01"
            }
        },
        {
            "name": "summarize",
            "type": "text",
            "response": "Your time off schedule is on January 5, 2025.",
            "keywords": [
                "January 5, 2025"
            ]
        }
    ],
    "story": "You want to know your time off schedule. Your username is nwaters. The start date is 2025-01-01. The end date is 2025-01-30.",
    "starting_sentence": "I want to know my time off schedule"
}
FieldTypeDescription
agentstringName of the agent associated with this dataset
goalsobjectDependency map of goal identifiers to arrays of dependent goals
goal_detailsarrayDetailed list of tool calls and final summary response
storystringNarrative description of the storyline of the data
starting_sentencestringThe user’s first utterance
  • agent

    • Type: string
    • Description: The name of the agent involved in the conversation. This is used to associate the dataset with a specific agent. To find the agent name, you can call orchestrate agents list
  • goals

    • Type: object
    • Description: Represents a goal dependency graph (a map of goal identifiers to arrays of dependent goals.). This encodes the logical or causal order in which tool calls should occur. Each key is a unique goal name or step in the process (e.g., "get_assignment_id_hr_agent-1"). Each value is an array of goals that depend on completion of this goal.

    Example:

    "get_assignment_id_hr_agent-1": ["get_timeoff_schedule_hr_agent-1"]
    

    This means that the agent must get the Assignment ID before it gets the timeoff schedule.

  • goal_details

    • Type: array
    • Description: A step-by-step list of tool calls and final text response made by the agent. Each object includes details about the action, such as type, name, arguments, and (for the final summary response) the actual text and extracted keywords.

    Action Types:

    • "tool_call": A specific tool call, with arguments specified in args.
    • "text": A text response, including the full response and key terms in keywords.

    Fields for each action:

    • type: "tool_call" or "text"
    • name: Action or step name (unique in each dataset)
    • tool_name: (For tool calls) Name of the tool/function invoked
    • args: (For tool calls) Dictionary of arguments passed to the tool
    • response: (For summary step) The agent’s natural language response
    • keywords: (For summary step) Key terms in the response
  • story

    • Type: string
    • Description: A free-text narrative summary of the user’s intent, key variables, and the main storyline of the data.
  • starting_sentence

    • Type: string
    • Description: The first user utterance or the main initiating statement of the chat session.

Generating datasets

You don’t need to create the datasets manually. The ADK includes commands to generate them automatically depending on your use case. For more information, see Creating evaluation dataset.