Documentation Index
Fetch the complete documentation index at: https://developer.watson-orchestrate.ibm.com/llms.txt
Use this file to discover all available pages before exploring further.
New in 2.9.0
Overview
Rubric evaluations use an LLM to evaluate agent responses based on custom criteria you define. This approach is particularly useful when:- You need to evaluate qualitative aspects of agent responses
- Your use case requires domain-specific evaluation criteria
- You want to assess policy compliance or procedural adherence
- Standard metrics don’t capture the nuances of your agent’s performance
How it works
The rubric evaluation process:- Analyzes the conversation between the user and agent
- Compares the agent’s response against expected answers or descriptions
- Evaluates each custom criterion independently using an LLM as a judge (LLMaaJ)
- Assigns a binary score (pass/fail) to each criterion
- Calculates an overall score as the average of all criteria
Rubric evaluations require an LLM provider to act as the judge. Ensure your evaluation environment has access to the necessary LLM services.
Defining custom criteria
Each criterion should clearly describe what you’re evaluating. Examples include:- Policy adherence: “Did the agent follow proper security and eligibility procedures?”
- Source citations: “Response must include references to official policy documents”
- Date format: “All dates must be in ISO format (YYYY-MM-DD)”
- Completeness: “Must cover all required topics including eligibility, procedures, and timelines”
Example configuration
To use rubric evaluations, addRubricEvaluation to your metrics list and define your custom criteria in the operator_configs section:
config.yaml
custom_criteria should clearly describe the evaluation rule. The LLM judge will assess whether the agent’s behavior meets each criterion.
Understanding results
Results from rubric evaluations include:- Individual criterion scores: Binary pass/fail for each defined criterion
- Reasoning: Explanation for each criterion’s score
- Overall score: Average of all criterion scores
- Summary: High-level assessment of the agent’s performance

