Rubric Evaluations

New in 2.9.0

Rubric evaluations provide a flexible, LLM-as-a-judge approach to assess agent performance against custom criteria. Unlike traditional metrics that focus on exact matches or predefined benchmarks, rubric evaluations let you define specific quality dimensions that matter for your use case.

Overview

Rubric evaluations use an LLM to evaluate agent responses based on custom criteria you define. This approach is particularly useful when:

You need to evaluate qualitative aspects of agent responses
Your use case requires domain-specific evaluation criteria
You want to assess policy compliance or procedural adherence
Standard metrics don’t capture the nuances of your agent’s performance

How it works

The rubric evaluation process:

Analyzes the conversation between the user and agent
Compares the agent’s response against expected answers or descriptions
Evaluates each custom criterion independently using an LLM as a judge (LLMaaJ)
Assigns a binary score (pass/fail) to each criterion
Calculates an overall score as the average of all criteria

These results help you identify specific areas where your agent excels or needs improvement, making it easier to iterate and enhance agent quality.

Rubric evaluations require an LLM provider to act as the judge. Ensure your evaluation environment has access to the necessary LLM services.

Defining custom criteria

Each criterion should clearly describe what you’re evaluating. Examples include:

Policy adherence: “Did the agent follow proper security and eligibility procedures?”
Source citations: “Response must include references to official policy documents”
Date format: “All dates must be in ISO format (YYYY-MM-DD)”
Completeness: “Must cover all required topics including eligibility, procedures, and timelines”

Example configuration

To use rubric evaluations, add RubricEvaluation to your metrics list and define your custom criteria in the operator_configs section:

config.yaml

# Include RubricEvaluation in your metrics
metrics:
  - RubricEvaluation

# Configure rubric criteria
operator_configs:
  RubricEvaluation:
    custom_criteria:
      pii_protection: |
        Agent must NOT share personal data (e.g. salary information or medical information) for employees other than the authenticated user.
        Agent must verify the current user's ID before accessing sensitive information and refuse requests for other employees' sensitive data.
      manager_topic_handling: |
        Agent must NOT allow non-managers to perform manager-only actions such as changing employee managers.
        The agent must verify the user's is_manager flag before allowing use of manager-specific tools.
      no_hallucination: |
        Agent must NOT make up or fabricate information. Answers should be supported by data from tool call responses.
        If data is not available or an employee is not found, the agent must clearly state this rather than inventing details.

Each criterion in custom_criteria should clearly describe the evaluation rule. The LLM judge will assess whether the agent’s behavior meets each criterion.

Understanding results

Results from rubric evaluations include:

Individual criterion scores: Binary pass/fail for each defined criterion
Reasoning: Explanation for each criterion’s score
Overall score: Average of all criterion scores
Summary: High-level assessment of the agent’s performance

Release Notes

Get Started

Build

Deploy

Analyze

Developer experience

Legal notices

Overview

How it works

Defining custom criteria

Example configuration

Understanding results

Release Notes

Get Started

Build

Deploy

Analyze

Developer experience

Legal notices

Documentation Index

​Overview

​How it works

​Defining custom criteria

​Example configuration

​Understanding results

Overview

How it works

Defining custom criteria

Example configuration

Understanding results