Before you begin
In order to analyze, you must first evaluate your agent. For more information, see Evaluating agents and tools.Analyzing
Theanalyze command provides a detailed breakdown of your agent evaluation results, highlighting where the agent succeeded, failed, and why.
The analyze command generates an overview analysis for each dataset result in the specified directory. It helps you quickly identify:
- Which tool calls were expected and made
- Which were irrelevant or incorrect
- Any parameter mismatches
- A high-level summary of the agent’s performance
- Missed tool calls
- Analysis Summary: Displays the overall evaluation type (e.g., Multi-run), total number of runs, number of runs with problems, and the overall status. This provides a quick high-level view of the evaluation results.
- Test Case Summary: Presents key counts for each test run, including expected vs. actual tool calls, correct tool calls, text match results, and journey success status.
- Conversation History: Step-by-step breakdown of every message exchanged, providing insight into where things went right or wrong.
- Analysis Results: Details the specific mistakes, along with the reasoning for each error (e.g., irrelevant tool calls).
Flags
Flags
Directory where your evaluation results are saved.
New in 1.11.0: Directory containing tool definitions.
Path to the
.env file that overrides the default environment.Either
default or enhanced. enhanced mode optionally provides doc string enrichments for tools.
By default, doc string recommendations stay limited to tools with minimal descriptions. To enable more recommendations, set export GATE_TOOL_ENRICHMENTS=false. Before using a recommendation, make a copy of the tool, try the new recommended doc string, and validate the performance again.Analyzing tools
New in 1.11.0
--tools-path flag, the analyzer will:
- Inspect the Python source file containing your tool definitions.
- Evaluate the quality of each tool’s description (docstring).
- Display:
- A warning if the description is missing or classified as poor.
- An OK message if the description meets quality standards.
Note:
- Description quality analysis only runs for tools that failed during evaluation.
- For now, you can use only Python tools.
Example Output
Runninganalyze on the evaluation results of a dataset, such as examples/evaluations/analysis/multi_run_example, produces an output like the following for both runs of data_commplex.json:

Example of enhanced mode

