> ## Documentation Index
> Fetch the complete documentation index at: https://developer.watson-orchestrate.ibm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

The **evaluation framework** in the ADK is designed to assess agent behavior by comparing simulated agent interactions, referred to as *trajectories*, against a predefined set of reference data. It also includes tools to generate this reference data and supports the entire agent development lifecycle.

## Setup environment

By default, the evaluation framework uses your `localhost` development environment (watsonx Orchestrate Developer Edition). You can use a `.env` file to change your environment with the `--env-file` flag. To do that, you must set up your `.env` file similar to how you [install the watsonx Orchestrate Developer Edition](../developer_edition/wxOde_setup).

Most of the evaluation framework capabilities depend on model-proxy or wx.ai models. If you use the Dallas data center, everything works fine. But if you need to run the framework with model-proxy and a `WO_INSTANCE` that points to a non-Dallas region, you can supply the model override flag:

<Tabs>
  <Tab title="For all regions except London, Tokyo, and Toronto">
    ```bash theme={null}
    export MODEL_OVERRIDE="meta-llama/llama-3-2-90b-vision-instruct"
    ```
  </Tab>

  <Tab title="For London, Tokyo, and Toronto">
    ```bash theme={null}
    export MODEL_OVERRIDE="meta-llama/llama-3-3-70b-instruct"
    ```
  </Tab>
</Tabs>

The evaluation framework works with SaaS and on-premises offerings of watsonx Orchestrate. You can now run your evaluations only in your **draft** environments of the remote instances.

To do that, you must activate your environment and run the evaluation commands on this active remote environment. For more information see [Configure access to remote environments](../environment/initiate_environment).

## Agent Pre-Deployment Lifecycle

The agent development process before deployment is iterative and consists of four key stages:

```mermaid theme={null}
graph LR
    A[Develop] --> B[Evaluate]
    B --> C[Analyze]
    C --> D[Improve]
    D --> A
```

1. **Develop**: Build and configure the agent.
2. **Evaluate**: Measure performance against reference data.
3. **Analyze**: Inspect agent behavior and identify patterns or issues.
4. **Improve**: Refine the agent based on evaluation results.

## Evaluation process

Agent evaluation is driven by a **user story**, which provides all necessary context for the agent to perform tool-based tasks.

A **user story** consists of all the necessary context and goals that the user expects to achieve in a specific interaction with the agent. A good user story includes user information, as well as the description of the goal and overall context of the interaction.

During the evaluation process, the evaluation framework simulates a user interaction with the **target agent**, which is the agent being evaluated. This new interaction is driven by a **user agent** that is powered by an LLM and follows the instructions given in the **user story**. The **user agent** uses the context of the user story to send messages to the **target agent**.

The responses returned by the **target agent** are compared to the expected responses in the **user story**, and if they are an exact match, the target agent successfully passes the evaluation.

## Success criteria

A **trajectory** is a sequence of actions taken by an agent in response to a user query. A trajectory is considered successful when:

* The agent performs all required tool calls in the correct order with the expected input parameters.
* The agent generates a high-quality summary that effectively addresses the user's query.

When both conditions are met, the evaluation framework marks the interaction as a **successful journey**, contributing to the `Journey Success` metric.

## Ground truth datasets

**Ground truth datasets** are used to benchmark and evaluate agent performance. These datasets are structured in JSON format and include:

* Full conversation context
* Tool call sequences
* Dependency graphs
* Expected final summaries

The evaluation framework uses this data to score agent behavior and identify areas for improvement.

**Example dataset**

```json theme={null}
{
    "agent": "hr_agent",
    "goals": {
        "get_assignment_id_hr_agent-1": [
            "get_timeoff_schedule_hr_agent-1"
        ],
        "get_timeoff_schedule_hr_agent-1": [
            "summarize"
        ]
    },
    "goal_details": [
        {
            "type": "tool_call",
            "name": "get_assignment_id_hr_agent-1",
            "tool_name": "get_assignment_id_hr_agent",
            "args": {
                "username": "nwaters"
            }
        },
        {
            "type": "tool_call",
            "name": "get_timeoff_schedule_hr_agent-1",
            "tool_name": "get_timeoff_schedule_hr_agent",
            "args": {
                "assignment_id": "15778303",
                "end_date": "2025-01-30",
                "start_date": "2025-01-01"
            }
        },
        {
            "name": "summarize",
            "type": "text",
            "response": "Your time off schedule is on January 5, 2025.",
            "keywords": [
                "January 5, 2025"
            ]
        }
    ],
    "story": "You want to know your time off schedule. Your username is nwaters. The start date is 2025-01-01. The end date is 2025-01-30.",
    "starting_sentence": "I want to know my time off schedule"
}
```

| Field               | Type     | Description                                                     |
| ------------------- | -------- | --------------------------------------------------------------- |
| `agent`             | `string` | Name of the agent associated with this dataset                  |
| `goals`             | `object` | Dependency map of goal identifiers to arrays of dependent goals |
| `goal_details`      | `array`  | Detailed list of tool calls and final summary response          |
| `story`             | `string` | Narrative description of the storyline of the data              |
| `starting_sentence` | `string` | The user's first utterance                                      |

* `agent`
  * Type: `string`
  * Description: The name of the agent involved in the conversation. This is used to associate the dataset with a specific agent. To find the agent name, you can call `orchestrate agents list`

* `goals`

  * Type: `object`
  * Description: Represents a goal dependency graph (a map of goal identifiers to arrays of dependent goals.). This encodes the logical or causal order in which tool calls should occur. Each key is a unique goal name or step in the process (e.g., `"get_assignment_id_hr_agent-1"`). Each value is an array of goals that depend on completion of this goal.

  *Example:*

  ```json theme={null}
  "get_assignment_id_hr_agent-1": ["get_timeoff_schedule_hr_agent-1"]
  ```

  This means that the agent must get the Assignment ID before it gets the timeoff schedule.

* `goal_details`

  * Type: `array`
  * Description: A step-by-step list of tool calls and final text response made by the agent. Each object includes details about the action, such as type, name, arguments, and (for the final summary response) the actual text and extracted keywords.

  Action Types:

  * `"tool_call"`: A specific tool call, with arguments specified in `args`.
  * `"text"`: A text response, including the full response and key terms in `keywords`.

  *Fields for each action:*

  * `type`: `"tool_call"` or `"text"`
  * `name`: Action or step name (unique in each dataset)
  * `tool_name`: (For tool calls) Name of the tool/function invoked
  * `args`: (For tool calls) Dictionary of arguments passed to the tool
  * `response`: (For summary step) The agent’s natural language response
  * `keywords`: (For summary step) Key terms in the response

* `story`
  * Type: `string`
  * Description: A free-text narrative summary of the user’s intent, key variables, and the main storyline of the data.

* `starting_sentence`
  * Type: `string`
  * Description: The first user utterance or the main initiating statement of the chat session.

## Argument Matching Strategies

The framework supports multiple argument matching strategies for tool call validation:

### Strict/Exact Matching (Default)

```json theme={null}
{
  "args": {
    "username": "nwaters",
    "assignment_id": "15778303"
  },
  "arg_matching": {
    "username": "strict",
    "assignment_id": "strict"
  }
}
```

* Requires exact match of argument values
* Both field name and value must match exactly
* Default strategy if `arg_matching` is not specified

### Fuzzy Matching

```json theme={null}
{
  "args": {
    "username": "nwaters",
    "query": "time off schedule information"
  },
  "arg_matching": {
    "username": "strict",
    "query": "fuzzy"
  }
}
```

* Uses semantic similarity for matching
* Useful for natural language fields (e.g., allows "time off schedule" to match with "time off schedule information")
* Threshold controlled by `similarity_threshold` in TestConfig

#### How Fuzzy Matching Works

Implementation in `src/agentops/llm_matching.py` (`LLMMatcher`):

1. **Primary**: LLM as a judge semantic match (`llm_semantic_match`)
2. **Fallback 1**: Cosine similarity over embeddings (default model: `sentence-transformers/all-minilm-l6-v2`)
3. **Fallback 2**: fuzzywuzzy.WRatio (token-similarity score)

#### Configuring Fuzzy Matching

```yaml theme={null}
# In config.yml
similarity_threshold: 0.8  # Similarity score threshold (0.0-1.0, default: 0.8)
enable_fuzzy_matching: true  # Enable fuzzy matching globally
```

### Ignore Matching

```json theme={null}
{
  "args": {
    "username": "nwaters",
    "assignment_id": "15778303",
    "metadata": "optional_field"
  },
  "arg_matching": {
    "username": "strict",
    "assignment_id": "strict",
    "metadata": "ignore"
  }
}
```

* Field is completely ignored during validation
* No penalty whether present or absent
* Useful for fields that vary but don't affect correctness

### Optional Matching

```json theme={null}
{
  "args": {
    "assignment_id": "15778303",
    "start_date": "2025-01-01"
  },
  "arg_matching": {
    "assignment_id": "strict",
    "start_date": "optional"
  }
}
```

* Field is optional in the agent's tool call
* If present, must match exactly
* If absent, no penalty

### Generating datasets

You don't need to create the datasets manually. The ADK includes commands to generate them automatically depending on your use case. For more information, see [Creating evaluation dataset](./create_data).
