Skip to main content
Version 2.0
Last Updated: 2026-04-10
IMPORTANT NOTICE ABOUT PERFORMANCE NUMBERSThis guide focuses on helping you understand how to measure and optimize performance in your wxO solutions. Performance will vary significantly based on your specific workload, configuration, system load, and other factors. Always measure performance in your own environment. The guidance provided helps you understand relative differences between approaches and identify optimization opportunities.
IMPORTANT: Large-Scale Performance Load Testing & Deployment PlanningWhile wxO is designed to scale elastically, we maintain optimal utilization ratios to operate the system effectively. If you plan to conduct large-scale performance load testing, please notify wxO support at least one week in advance. This allows us to:
  • Ensure the system is properly prepared for your test
  • Monitor system behavior together with you during the test
  • Provide guidance and support throughout the testing process
For successful large-scale deployments, please share with IBM your:
  • Intended rollout schedule
  • Expected user growth projections
  • Anticipated usage patterns
This enables us to proactively monitor the system and ensure optimal performance as your deployment scales.Contact IBM Support to coordinate your performance testing activities and deployment planning.

About This Guide

This is the restructured version of the wxO Performance Guide, organized by the correct architecture hierarchy:
  • Agents (highest level) - Orchestrate and reason
  • Flow (wxO Agentic Workflow) - Sequences Agents, Tools, and People
  • Tools (execution level) - Flow, Python, Langflow, API, MCP, and Knowledge Tools.

Document Structure

This guide is organized into modular sections for easier maintenance and updates:
  1. Introduction and Overview (this document)
  2. Understanding Performance Testing (this document)
  3. Agent Performance - Dedicated document
  4. Flow Performance - Dedicated document
  5. Tool Performance - Dedicated document
  6. Knowledge Performance - Dedicated document
  7. Accessing Performance Data (this document)
  8. Quick Reference (this document)

Introduction and Overview

Purpose of This Guide

This guide helps you understand, measure, and optimize the performance of your agents, flows, and tools in watsonx Orchestrate (wxO). Whether you’re building approval workflows, data processing pipelines, or interactive agents, understanding performance characteristics at each level is essential for delivering excellent user experiences.

Balancing Performance with Quality

⚖️ Speed is Not EverythingIn agentic solutions, optimal performance is about finding the right balance between:
  • Response Time: How quickly the system responds
  • Information Relevancy: How well the response addresses the user’s needs
  • Accuracy: How correct and reliable the information is
  • Engagement Experience: How natural and helpful the interaction feels
A faster response that provides irrelevant or inaccurate information creates a poor user experience. Similarly, a highly accurate response that takes too long may frustrate users. This guide helps you optimize response time while maintaining the quality and accuracy that users expect from agentic solutions.

Who Should Use This Guide

This guide is designed for:
  • Developers building agents, flows, and tools
  • Architects designing wxO solutions
  • Technical leads responsible for performance and SLAs
  • DevOps engineers monitoring production systems

Understanding the wxO Architecture

wxO follows a hierarchical architecture where each level has distinct performance characteristics: wxO Architecture Key Concepts:
  • Agents are the highest level - they orchestrate workflows and make decisions by using tools
  • Agents can call other Agents: An agent can use another agent as a tool, which in turn can call other agents and tools, creating nested agent hierarchies
  • Tools are the execution layer that agents can use:
    • Flow (wxO Agentic Workflow): A stateful workflow tool that support long running tasks and can deterministically sequence Agents, Tools, and People
    • Python Tools: Custom Python functions
    • Langflow Tools: Langflow-based flows
    • API Tools: External service integrations
    • MCP Tools: Model Context Protocol tools
    • Knowledge Tools: Knowledge base and retrieval tools
  • Flow is a Tool: Flow is a special type of tool that provides orchestration capabilities
  • Performance at each level affects the others - e.g., slow tools make flows slow, slow flows make agents slow, and nested agent calls compound performance impacts

What You Can Control and Measure

As a wxO SaaS customer, you have control over application-level performance: ✅ What You Control:
  • Agent configuration and prompts
  • Flow design and structure
  • Tool implementation and efficiency
  • Data mapping strategies
  • User interaction design
  • External API integrations
✅ What You Can Measure:
  • Agent response time and quality
  • Flow execution time and breakdown
  • Tool execution times
  • Task-level performance
  • LLM interaction duration
  • Error rates and success metrics
  • User response times
🔒 What IBM Manages:
  • Infrastructure scaling and availability
  • Platform-level performance
  • Orchestration engine optimization
  • Database performance
  • Network infrastructure

Shared Responsibility Model

In wxO SaaS, performance optimization is a shared responsibility:
ResponsibilityYou (Customer)IBM (Platform)
Agent prompt engineering
Flow design efficiency
Tool code optimization
Data mapping strategy
External API performance
Orchestration engine
Infrastructure scaling
Platform availability
Database optimization

When to Conduct Performance Testing

Consider performance testing when:
  • Before deployment: Establish baselines and validate SLAs
  • After major changes: Verify performance impact of new features
  • During optimization: Measure improvement from changes
  • Investigating issues: Diagnose slow or failing agents/flows/tools
  • Capacity planning: Understand limits and scaling needs
  • Significant time has elapsed since last baseline: Re-establish an accurate baseline for comparisons

How to Use This Guide

For Understanding Performance:
  1. Start with the architecture overview (above)
  2. Read the relevant section for what you’re optimizing:
For Optimization:
  1. Identify the bottleneck level (agent, flow, or tool)
  2. Read the relevant performance guide (agent, flow, or tool)
  3. Apply level-specific optimizations from that guide
  4. Measure improvement
For Troubleshooting:
  1. Review the relevant performance guide (agent, flow, or tool)
  2. Check the performance testing methodology
  3. Follow the diagnosis and optimization steps
For Quick Reference:
  • Use the Quick Reference section below
  • Check performance baselines
  • Review optimization checklists
This guide focuses on SaaS deployments. Infrastructure is managed by IBM, so we emphasize application-level optimization that you control.

Understanding Performance Testing

Overview

Not all performance testing is the same. Choosing the right type of test helps you answer specific questions about your agents, flows, and tools. This section helps you identify which tests are most relevant for your needs.
CRITICAL: Simulating Realistic User InteractionsUnderstanding the Stateless Nature of LLM InferenceLLM inference is stateless - each request is independent. When testing agents, it’s crucial to understand how conversation history affects performance:How Conversation History Works:
  • When a user asks an agent multiple questions in a session, each new question includes the entire conversation history
  • The LLM processes the full context (all previous Q&A pairs + the new question) for every interaction
  • As the conversation grows, both response time and token consumption increase cumulatively
Why This Matters for Performance Testing:
  • Artificially driving high load with unrealistic interaction patterns will produce misleading results
  • Testing with excessive interactions per session creates artificially large contexts that don’t reflect real usage
  • This leads to inflated response times and token consumption that won’t occur in production
Realistic Testing Parameters:
  1. Interactions per Agent Session:
    • Recommended: 3-5 interactions per agent session
    • Why: Real users typically ask a few related questions before starting a new conversation
    • Adjust based on your use case: Analyze your actual user behavior patterns
    • Avoid: Long conversation chains (10+ interactions) unless your use case specifically requires them
  2. Realistic Think Time:
    • Recommended: 10-30 seconds between interactions (depending on complexity)
    • Why: Real users need time to read, understand, and formulate follow-up questions
    • Example: If a user asks an agent for a list of doctors, they will take time to review the results before asking the next question
    • Avoid: Rapid-fire questions every 5 seconds - this doesn’t reflect real user behavior
Impact of Unrealistic Testing:
❌ Unrealistic: 20 interactions per session, 5s think time
   - Creates 20x context accumulation
   - Artificially slow response times
   - Inflated token costs
   - Does not represent production behavior

✅ Realistic: 3-5 interactions per session, 15-20s think time
   - Matches actual user behavior
   - Accurate performance metrics
   - Realistic token consumption
   - Valid basis for capacity planning
Best Practice: Before conducting performance tests, analyze your actual user interaction patterns (if available) or use industry-standard assumptions for your use case type. This ensures your test results are meaningful and actionable.

Tests Relevant for SaaS Customers

1. Baseline Testing

Purpose: Establish inherent performance characteristics under minimal load When to Use:
  • Initial development and deployment
  • After significant code changes
  • Creating performance benchmarks
  • Comparing different implementation approaches
Configuration:
  • Users: 1-10 concurrent users
  • Duration: 15-30 minutes
  • Ramp-up: Immediate
  • Think Time: 10-30 seconds (realistic user behavior)
  • Interactions per Session: 3-5 interactions per agent session
What You’ll Learn:
  • Minimum execution time for your agent/flow/tool
  • Individual component performance
  • Baseline for comparison
Success Criteria:
- Components complete successfully
- Response times are consistent
- No errors or timeouts
- Metrics match expectations
Example Scenarios:
  • “I’ve built an agent with 3 tools. I want to know how fast it runs with a single user.”
  • “I’ve created a new Python tool. I want to measure its baseline execution time.”
  • “I’ve designed a flow with 5 tasks. I want to establish performance benchmarks.”

2. Load Testing

Purpose: Validate system behavior under expected production load When to Use:
  • Before production deployment
  • Validating SLA compliance
  • Testing with realistic user volumes
  • Verifying concurrent execution handling
Configuration:
  • Users: Expected peak concurrent users
  • Duration: 30-60 minutes
  • Ramp-up: 10-15 minutes
  • Think Time: 10-30 seconds (realistic user behavior - adjust based on use case complexity)
  • Interactions per Session: 3-5 interactions per agent session
What You’ll Learn:
  • Performance under realistic load
  • Concurrent execution behavior
  • Error rates at scale
  • Whether you meet SLA requirements
Success Criteria:
- Meets SLA requirements (e.g., 95th percentile < target)
- Error rate < 1%
- No timeout failures
- Consistent performance across duration
Example Scenario:
“I expect 50 concurrent users during peak hours. I want to verify my agent can handle this load while maintaining acceptable response times.”
Important Considerations for SaaS:
  • IBM manages infrastructure scaling automatically
  • Focus on application-level performance
  • Contact IBM support if hitting platform limits

Test Case Selection Criteria

Choose test cases based on these criteria:

Transaction-Based Selection

Focus on high-frequency operations:
  • Search and query operations
  • Status checks and monitoring
  • Real-time data retrieval
  • Frequently accessed workflows
Example Test Cases:
✓ Document search flow
✓ Status query tools
✓ Notification processing
✓ Data synchronization flows

Business-Critical Operations

Focus on operations with business impact:
  • Revenue-generating workflows (payments, orders)
  • Customer-facing operations (support, onboarding)
  • SLA-bound processes (response time commitments)
  • Compliance-required operations (audit, security)
Example Test Cases:
✓ Purchase approval agents
✓ Customer onboarding flows
✓ Support ticket routing
✓ Contract generation
✓ Compliance reporting

Resource-Intensive Operations

Focus on operations that consume significant resources:
  • Complex agent interactions with multiple reasoning steps
  • Document processing and extraction
  • Large data transformations
  • Multiple external API calls
  • Long-running workflows
Example Test Cases:
✓ Document analysis agents
✓ Multi-step reasoning agents
✓ Batch data processing flows
✓ Complex approval chains
✓ Integration-heavy workflows

What About Stress and Endurance Testing?

Stress Testing (finding breaking points) and Endurance Testing (long-running stability) are not the responsibility of customers in wxO SaaS:
  • IBM conducts these tests as part of ongoing performance validation of the wxO platform
  • IBM manages infrastructure capacity and scaling
  • Platform automatically handles load distribution
Customer Focus:
  • Focus your efforts on application-level optimization (baseline and load testing)
  • If you suspect platform-level issues, contact IBM support to investigate infrastructure performance

Triggering Agents for Performance Testing

Overview

Before you can access performance data, you need to trigger your agents to generate test data. wxO provides two primary methods for executing agents:

Method 1: Chat UI

Best for: Interactive testing and manual validation The wxO Chat UI provides a user-friendly interface for:
  • Testing agent responses interactively
  • Validating agent behavior with real conversations
  • Exploring different user inputs and scenarios
Access: Available through the wxO web interface

Method 2: Orchestrate Runs API

Best for: Baseline testing and load testing The Orchestrate Runs API allows you to:
  • Execute agents programmatically
  • Automate performance testing
  • Run baseline tests (1-10 concurrent users)
  • Run load tests with multiple concurrent requests
  • Integrate agent testing into CI/CD pipelines
API Documentation: Orchestrate Runs API

Detailed Testing Guidance

For comprehensive information on agent testing methodology, test scenarios, and best practices, see:

Accessing Performance Data

Overview

wxO provides multiple methods for accessing performance data:
  1. Built-in Monitoring UI - Visual interface for exploring traces
  2. Programmatic API Access - Retrieve traces via REST API for automation
  3. Trace CLI - Command-line interface for trace analysis (wxO ADK v2.4.0+)
All methods provide access to agent, flow, and tool performance data.

Method 1: Built-in Monitoring UI

Documentation: IBM watsonx Orchestrate - Monitoring Agents

Method 2: Programmatic API Access

There is no direct correlation between agent runs and trace IDs. To retrieve traces:
  1. Execute your agent using the orchestrate runs API or other method
  2. Search for traces using the searchTraces API by time range and agent ID to find the trace_id
  3. Retrieve trace spans using the Get Spans for Trace API

Method 3: Trace CLI

Documentation: Traces with CLI Available in: wxO ADK v2.4.0 and later For detailed usage and examples, see the Trace CLI documentation.

Quick Reference

Performance Characteristics by Level

Agent Performance

What to Measure:
  • Total response time
  • Number of reasoning loops
  • Tool calls made
  • Success rate and accuracy
Optimization Focus:
  • Prompt engineering
  • Tool selection strategy
  • Context management
  • Model selection

Flow Performance

What to Measure:
  • Total execution time
  • Per-task execution time
  • Orchestration overhead
  • User interaction time
Optimization Focus:
  • Task count reduction
  • Data mapping strategy
  • Parallel execution
  • User interaction design

Tool Performance

Characteristics:
  • Python tools: Typically sub-second for simple operations
  • Langflow tools: Minimum 2+ seconds due to Langflow initialization
  • API tools: Varies based on external service
  • Knowledge tools: Varies based on repository type and search strategy
  • Timeout limit: 2 minutes maximum
What to Measure:
  • Individual tool execution time
  • External API latency
  • Knowledge retrieval time
Optimization Focus:
  • Algorithm efficiency
  • API call batching
  • Tool type selection
  • Knowledge search strategy

Common Pitfalls

Agent Level:
  • ❌ Using Agents for simple text generation → ✅ Use Generative Prompts in Flow
  • ❌ Too many tools available → ✅ Limit to relevant tools only
  • ❌ Vague prompts → ✅ Use specific, clear prompts
  • ❌ Ignoring accuracy metrics → ✅ Balance speed and quality
  • ❌ Extensive guidelines → ✅ Keep guidelines concise and focused
  • ❌ Inefficient pre/post plugins → ✅ Optimize plugin code for performance
Flow Level:
  • ❌ Using auto-mapping for well-defined schemas → ✅ Use explicit data mapping for known, stable schemas (auto-mapping provides robust tool/agent response handling but adds latency)
  • ❌ Too many small tasks → ✅ Combine related operations
  • ❌ Not pre-filling forms → ✅ Use context to populate fields
  • ❌ Sequential when parallel possible → ✅ Run independent tasks concurrently
Tool Level:
  • ❌ Using Langflow for simple deterministic logic → ✅ Use Python tools or just code block in Flows
  • ❌ Multiple sequential API calls → ✅ Batch into single call
  • ❌ Trying to persist state → ✅ Use external state management
  • ❌ Ignoring 2-minute timeout → ✅ Design for timeout constraints

Key Performance Factors

Agent Performance Factors:
  • Prompt complexity and clarity
  • Number of available tools
  • Reasoning loop iterations
  • LLM model selection
  • Context size
  • Guidelines (extensive guidelines can result in latency)
Flow Performance Factors:
  • Number of tasks
  • Data mapping strategy
  • Task dependencies
  • User interaction design
  • Agent context retrieval
Tool Performance Factors:
  • Tool type
  • Algorithm efficiency
  • External API latency
  • Data volume
  • Custom caching strategy (tool developers can implement caching in external services)


Next Steps

  1. Understand your architecture: Review the hierarchy above
  2. Identify your bottleneck: Agent, Flow, Tool, or Knowledge level?
  3. Read the relevant guide:
  4. Apply optimizations: Use strategies from the relevant performance guide
  5. Monitor and iterate: Continuous improvement