Skip to main content
IMPORTANT NOTICEThis guide focuses on helping you understand how to measure and optimize agent performance in your wxO solutions. Performance will vary significantly based on your specific workload, configuration, and system load. Always measure performance in your own environment.
Note: “Flow” in this document refers to wxO Agentic Workflow

Agent Performance Overview

What is an Agent in wxO?

An agent in watsonx Orchestrate is an intelligent component that:
  • Reasons about problems and makes decisions
  • Orchestrates tools and flows to accomplish tasks
  • Interacts with users through natural language
  • Adapts its approach based on context and results

Agent Execution Model (ReAct Loop)

Agents use a ReAct (Reasoning and Acting) loop to solve problems: ReAct Loop Steps:
  1. Thought: Analyze the situation (LLM call)
  2. Action: Select tool or provide answer (LLM call)
  3. Observation: Execute tool and observe result
  4. Repeat or Finish: Continue loop or provide answer
Key Characteristics:
  • Each loop iteration involves multiple LLM calls
  • Agents can call tools (Python, Langflow, APIs, Flows)
  • Agents can invoke other agents
  • Loop continues until agent determines task is complete
Performance Impact of Nested Agents
  • When an agent calls another agent, the nested agent also runs its own ReAct loop
  • Each nested agent adds its own reasoning overhead (LLM calls for thinking and action selection)
  • Deep agent hierarchies multiply LLM inference time exponentially
  • Example: Agent A → Agent B → Agent C means 3x the reasoning overhead
  • Recommendation: Limit agent nesting depth to 2-3 levels maximum for optimal performance

Agent Performance Components

Total agent response time consists of:
Total Agent Response Time =
  Guidelines Processing Time (if applicable) +
  Reasoning Loop Time +
  Tool Selection Time +
  Tool Execution Time +
  LLM Inference Time +
  Context Processing Time +
  Plugin Logic Time (if applicable)
What Affects Performance:
  • Guidelines processing: LLM call before agent loop to process guidelines (if configured)
  • Number of reasoning loops: More loops = longer response time
  • Tool selection complexity: More available tools = longer selection time
  • Tool execution time: Slow tools slow down the entire agent
  • LLM model size: Larger models are slower but more capable
  • Context size: Larger context = longer processing time
  • Prompt complexity: Complex prompts require more reasoning
  • Plugin logic: Custom plugin code execution (if using plugins)

How to Test Agent Performance

Testing Methodology

For comprehensive testing methodology, see the main Performance Guide which covers:
  • Baseline Testing: Establish performance benchmarks under minimal load
  • Load Testing: Validate behavior under expected production load
  • Stress Testing: Determine system breaking points (less relevant for SaaS)

Testing Agents via API

API Documentation: Orchestrate Runs API How to Test:
  1. Execute agent runs and retrieve run information using the Orchestrate Assistant Runs API
  2. Get detailed traces using the searchTraces API and Get Spans for Trace API
For Agents that Use Flows:
  • User requests (human-in-the-loop interactions) must be obtained through the Messages API
  • This API retrieves messages from a thread, including user responses to flow requests
For Automated Testing:
  • Use the APIs programmatically to run multiple test queries
  • Collect performance metrics across runs
  • Calculate statistics (average, median, percentiles)
  • Compare performance across different agent configurations

Testing Agents via Channel-Specific Interfaces

Available Testing Channels:
  • Web chat UI
  • Voice interfaces
  • Slack
  • Other integrated channels
Recommended Testing Approach:
  1. Test via API first - Ensure the agent works correctly through the API before running UI automation tests
  2. Then test via channels - Once API testing is successful, validate the agent through channel-specific interfaces
  3. Use appropriate tools - wxO has no specific tool recommendations; choose tools that fit your testing needs
Why Test API First
  • Isolates agent logic from channel-specific issues
  • Easier to debug and identify root causes
  • Faster test execution and iteration
  • Provides baseline performance metrics
  • Channel tests can then focus on UI/UX-specific concerns

Key Metrics to Track

Speed Metrics

  • Total time from query to response
  • Track: Average, Median, 95th percentile, 99th percentile
  • Target: Define based on your SLA requirements
  • Number of thought-action-observation cycles
  • Fewer loops generally means faster response
  • Track: Average loops per query type
  • Number of tools invoked per agent run
  • More tool calls = longer execution time
  • Track: Average tool calls per query type
  • Number of LLM inference calls
  • Each call adds latency
  • Track: Total calls per run

Quality Metrics

  • Percentage of correct responses
  • Critical: Speed without accuracy is useless
  • Track: Accuracy rate per query type
  • How well the response addresses the query
  • Track: User satisfaction scores
  • Whether the response fully answers the question
  • Track: Follow-up question rate
  • Similar queries should get similar responses
  • Track: Response similarity for equivalent queries

Cost Metrics

Token Usage:
  • Total tokens consumed per run
  • Directly impacts LLM costs
  • Track: Average tokens per query type
Tool Execution Costs:
  • External API calls and their costs
  • Track: API call count and associated costs

Agent Optimization Strategies

1. Choose the Right Approach

Different Agent Styles:
Use CaseRecommended ApproachWhy
Simple text generationDefault AgentFaster, simplier call
Straightforward Q&ADefault AgentNo reasoning loop needed
Complex reasoningReAct AgentNeeds multi-step thinking
Tool orchestrationReAct AgentRequires tool selection
Multi-step workflowsReAct AgentBenefits from reasoning loop

2. Optimize Prompts

Make Prompts Clear and Specific

  • ❌ Vague: “Help the user”
  • ✅ Clear: “Answer customer questions about order status using the order_lookup tool”

Provide Examples

  • Include few-shot examples in the prompt
  • Shows the agent the expected behavior
  • Reduces reasoning loops

Structure Output

  • Request structured responses (JSON, specific format)
  • Reduces generation time
  • Easier to parse

3. Optimize Guidelines

What are Guidelines: Agent Guidelines are instructions that help shape agent behavior.
Performance Impact
  • Guidelines trigger an LLM call before the agent loop starts
  • This adds latency to every agent invocation
  • The LLM processes guidelines to understand behavioral constraints
Optimization Strategies:
  • Keep guidelines concise: Shorter guidelines = faster processing
  • Use guidelines only when needed: Remove if not providing value
  • Test with and without: Measure impact on response time
  • Balance guidance vs speed: Guidelines improve quality but add latency

4. Limit Available Tools

Problem: Too many tools slow down tool selection
Solution:
  • Provide only relevant tools for the agent’s purpose
  • Group related tools
  • Use clear, descriptive tool names
Example:
  • ❌ Agent with 25 generic tools
  • ✅ Agent with 5-10 focused, relevant tools

4a. Avoid Deep Agent Hierarchies

Problem: Nested agents multiply LLM inference overhead
Why It Matters:
  • Each nested agent runs its own ReAct loop with multiple LLM calls
  • Deep hierarchies spend more time on reasoning than actual work
  • Agent A → Agent B → Agent C means 3x the reasoning overhead
  • LLM inference time compounds at each level
Solution:
  • Limit nesting depth to 2-3 levels maximum
  • Consider using Flows instead of nested agents for deterministic orchestration
  • Use tools (Python, API) for simple operations instead of wrapping them in agents
  • Flatten agent hierarchies where possible
Example:
  • ❌ Agent → Agent → Agent → Agent (4 levels of reasoning overhead)
  • ✅ Agent → Flow → Tools (reasoning only at top level, deterministic execution below)
  • ✅ Agent → Agent → Tools (2 levels, acceptable for complex scenarios)

5. Choose Appropriate Model

Model Selection Trade-offs:
Model SizeSpeedCapabilityUse Case
SmallFastestBasicSimple queries, high volume
MediumModerateGoodGeneral purpose
LargeSlowerBestComplex reasoning, high accuracy needs
Recommendation: Start with medium models, adjust based on accuracy requirements.Note on Caching: Caching agent responses can be implemented through pre-agent plugins if needed for your use case.

Agent Quality and Evaluation

Why Quality Matters

Speed vs Quality Trade-off
  • Fast but inaccurate agents provide poor user experience
  • Slow but accurate agents frustrate users
  • Goal: Optimize for both speed AND quality

Agent Evaluation

Evaluation Framework: Quick Evaluation of Agents and Tools What You Can Evaluate:
  • Agent accuracy on test queries
  • Response quality and relevance
  • Consistency across similar queries
  • Comparison between different agent configurations
Key Point: Always validate agent quality after making performance optimizations to ensure accuracy hasn’t degraded.

Best Practices Summary

Do’s

Measure before optimizing

  • Establish baseline performance
  • Identify actual bottlenecks
  • Track both speed and quality metrics

Use the right approach

  • Generative Prompts for simple tasks
  • Agents for complex reasoning
  • Don’t over-engineer

Optimize prompts

  • Clear, specific instructions
  • Include examples
  • Request structured output

Optimize guidelines

  • Keep guidelines concise
  • Use only when needed
  • Test impact on performance

Limit tool count

  • Provide only relevant tools
  • Use descriptive names
  • Group related functionality

Avoid deep agent hierarchies

  • Limit agent nesting to 2-3 levels maximum
  • Use Flows for deterministic orchestration
  • Use tools instead of agents for simple operations
  • Flatten hierarchies to reduce reasoning overhead

Optimize plugin logic

  • Write efficient plugin code
  • Cache plugin results when appropriate
  • Profile plugin performance

Validate quality

  • Use evaluation framework to validate accuracy
  • Ensure optimizations don’t degrade quality
  • Iterate based on real metrics

Don’ts

Common Pitfalls to AvoidDon’t provide too many tools
  • Slows down tool selection
  • Increases reasoning complexity
Don’t use verbose guidelines
  • Adds unnecessary LLM processing time
  • Keep guidelines concise and focused
Don’t create deep agent hierarchies
  • Each nested agent adds reasoning overhead
  • Multiplies LLM inference time exponentially
  • Limit nesting to 2-3 levels maximum
Don’t ignore quality metrics
  • Speed without accuracy is useless
  • Always validate accuracy after optimization
Don’t optimize without measuring
  • Measure first, optimize second
  • Track impact of each change
Don’t write inefficient plugin logic
  • Plugins run in agent execution path
  • Optimize plugin code for performance

Main Performance Guide

Comprehensive performance guide overview

Flow Performance

Flow-specific performance optimization

Tool Performance

Tool execution performance guide

API References