Skip to main content
Use the ADK to configure voice features for your agents.
Note: To use voice in watsonx Orchestrate Developer Edition, enable the voice feature by adding the --with-voice flag to the orchestrate server start command. For more information, see Installing watsonx Orchestrate Developer Edition: watsonx Orchestrate server.

Creating voice configurations

To create a voice configuration, first create a YAML file that defines your voice settings. This file includes the voice name, speech-to-text and text-to-speech provider settings, the primary language, and optional configurations for advanced settings.

Supported Providers

Speech-to-Text (STT) Providers:
  • watson_stt: IBM Watson Speech to Text - Docs
  • deepgram_stt: Deepgram Speech to Text - Docs
Text-to-Speech (TTS) Providers:
  • watson_tts: IBM Watson Text to Speech - Docs
  • deepgram_tts: Deepgram Text to Speech - Docs
  • elevenlabs_tts: ElevenLabs Text to Speech - Docs
Note: Please refer to the official documentations of each providers for more information on models and parameters.

Configuration Examples

Watson STT/TTS Configuration

Basic Example:
YAML
name: watson_voice_config
speech_to_text:
  provider: watson_stt
  watson_stt_config:
    api_url: "watson_stt_url"
    api_key: "your_watson_stt_api_key"
    model: "en-US"
text_to_speech:
  provider: watson_tts
  watson_tts_config:
    api_url: "watson_tts_url"
    api_key: "your_watson_tts_api_key"
    voice: "en-US_AllisonV3Voice"
language: "en-US"
Advanced Example:
YAML
name: watson_voice_config
llm_aggregation_timeout_seconds: 0.8
speech_to_text:
  provider: watson_stt
  watson_stt_config:
    api_url: "https://api.us-south.speech-to-text.watson.cloud.ibm.com/instances/your-instance-id"
    api_key: "your_watson_stt_api_key"
    model: "en-US_Telephony"
    background_audio_suppression: 0.5
    language_customization_id: null
    inactivity_timeout: 30
    profanity_filter: true
    smart_formatting: true
    speaker_labels: false
    redaction: false
    low_latency: true
    learning_opt_out: false
    watson_metadata: null
    smart_formatting_version: null
    customization_weight: null
    character_insertion_bias: null
    end_of_phrase_silence_time: 0.8
text_to_speech:
  provider: watson_tts
  watson_tts_config:
    api_url: "https://api.us-south.text-to-speech.watson.cloud.ibm.com/instances/your-instance-id"
    api_key: "your_watson_tts_api_key"
    voice: "en-US_AllisonV3Voice"
    language: "en-US"
    rate_percentage: 0
    pitch_percentage: 0
    customization_id: null
    meta_id: null
    learning_opt_out: false
language: "en-US"
Watson STT Configuration Parameters: Watson TTS Configuration Parameters:

Deepgram STT/TTS Configuration

Basic Example:
YAML
name: deepgram_voice_config
speech_to_text:
  provider: deepgram_stt
  deepgram_stt_config:
    api_url: "wss://api.deepgram.com/v1/listen"
    api_key: "your_deepgram_api_key"
    model: "nova-2"
text_to_speech:
  provider: deepgram_tts
  deepgram_tts_config:
    api_key: "your_deepgram_api_key"
Advanced Example (nova-3 with keyterm):
YAML
name: deepgram_voice_config
llm_aggregation_timeout_seconds: 0.8
speech_to_text:
  provider: deepgram_stt
  deepgram_stt_config:
    api_url: "wss://api.deepgram.com/v1/listen"
    api_key: "your_deepgram_api_key"
    model: "nova-3"
    keyterm: ["help", "search", "Mr. Smith"]  # Only supported for nova-3 models
    mip_opt_out: false
    channels: 1
    diarize: true
    dictation: false
    endpointing: 10
    extra: null
    interim_results: true
    keywords: null  # Only supported for nova-2 models
    language: "en-US"
    multichannel: false
    numerals: true
    profanity_filter: false
    punctuate: true
    redact: null
    replace: ["apple:orange", "orange:apple"]
    search: ["action item", "follow up"]
    smart_format: true
    tag: ["customer_service", "sales_call"]
    utterance_end_ms: 1000
    vad_events: true
    version: "latest"
    # v2/listen endpoint parameters
    eager_eot_threshold: null
    eot_threshold: null
    eot_timeout_ms: null
text_to_speech:
  provider: deepgram_tts
  deepgram_tts_config:
    api_key: "your_deepgram_api_key"
    api_url: "https://api.deepgram.com/v1/speak"
    language: "en"
    model: "aura-2-thalia-en"
    mip_opt_out: false
language: "en-US"
Deepgram STT Configuration Parameters: Deepgram TTS Configuration Parameters:

ElevenLabs TTS Configuration

Basic Example:
YAML
name: elevenlabs_tts_config
speech_to_text:
  provider: watson_stt
  watson_stt_config:
    api_url: "https://api.us-south.speech-to-text.watson.cloud.ibm.com/instances/your-instance-id"
    api_key: "your_watson_stt_api_key"
    model: "en-US_Telephony"
text_to_speech:
  provider: elevenlabs_tts
  elevenlabs_tts_config:
    api_key: "your_elevenlabs_api_key"
    model_id: "eleven_turbo_v2_5"
    voice_id: "CwhRBWXzGAHq8TQ4Fs17"
Advanced Example:
YAML
name: elevenlabs_tts_config
llm_aggregation_timeout_seconds: 0.8
speech_to_text:
  provider: watson_stt
  watson_stt_config:
    api_url: "example-stt-url"
    api_key: "example-stt-key"
    model: "example-model"
text_to_speech:
  provider: elevenlabs_tts
  elevenlabs_tts_config:
    api_url: "https://api.elevenlabs.io/v1/text-to-speech"
    api_key: "your_elevenlabs_api_key"
    model_id: "eleven_turbo_v2_5"
    voice_id: "CwhRBWXzGAHq8TQ4Fs17"
    language_code: "en"
    apply_text_normalization: "auto"
    optimize_streaming_latency: 3
    apply_language_text_normalization: false
    pronunciation_dictionary_locators: null
    seed: 42
    voice_settings:
      speed: 1.0
      stability: 0.5
      style: 0.0
      similarity_boost: 0.75
      use_speaker_boost: true
language: "en-US"
ElevenLabs TTS Configuration Parameters:

Advanced Voice Settings

Voice configurations support advanced features for enhanced call handling:

Voice Activity Detection (VAD)

Automatically detect when the user is speaking:
YAML
vad:
  enabled: true
  provider: "silero_vad"
  silero_vad_config:
    confidence: 0.7          # Speech detection confidence threshold (0.0-1.0)
    start_seconds: 0.2       # Time before transitioning to SPEAKING state
    stop_seconds: 0.8        # Time before transitioning to QUIET state
    min_volume: 0.6          # Minimum audio volume threshold (0.0-1.0)

DTMF Input (Dual-Tone Multi-Frequency)

Enable keypad input during voice calls:
YAML
dtmf_input:
  inter_digit_timeout_ms: 2500  # Time to wait for next digit (ms)
  termination_key: "#"          # Key to signal end of input
  maximum_count: 10             # Maximum number of digits
  ignore_speech: true           # Disable speech recognition during DTMF

User Idle Handler

Handle situations when the user stops responding:
YAML
user_idle_handler:
  enabled: true
  idle_timeout: 7                                      # Timeout in seconds
  idle_max_reprompts: 2                                # Max reprompt attempts
  idle_timeout_message: "Are you still there?"         # Message to play on idle
  idle_hangup_message: ""                              # Message to play before hanging up

Agent Idle Handler

Provide feedback when the agent is processing:
YAML
agent_idle_handler:
  typing_enabled: true                                 # Enable typing sound
  typing_duration_seconds: 5                           # Typing sound duration
  audio_clip_id: "guitar_1"                            # Hold Audio
  hold_audio_seconds: 15                               # Hold audio duration
  pre_hold_message: "Please wait a moment..."          # Message before hold audio
  hold_message: "Still processing your request..."     # Periodic hold message

Complete Advanced Example

YAML
name: advanced_voice_config
speech_to_text:
  provider: deepgram_stt
  deepgram_stt_config:
    api_url: "wss://api.deepgram.com/v1/listen"
    api_key: "your_deepgram_api_key"
    model: "nova-2"
    language: "en-US"
    numerals: true
    keywords: ["urgent", "important", "priority"]  # nova-2 specific
    mip_opt_out: false
text_to_speech:
  provider: deepgram_tts
  deepgram_tts_config:
    api_key: "your_deepgram_api_key"
    api_url: null
    language: "en"
    model: "aura-2-thalia-en"
    mip_opt_out: false
language: "en-US"

# Advanced Settings
vad:
  enabled: true
  provider: "silero_vad"
  silero_vad_config:
    confidence: 0.7
    start_seconds: 0.2
    stop_seconds: 0.8
    min_volume: 0.6

dtmf_input:
  inter_digit_timeout_ms: 2500
  termination_key: "#"
  maximum_count: 10
  ignore_speech: true

user_idle_handler:
  enabled: true
  idle_timeout: 7
  idle_max_reprompts: 2
  idle_timeout_message: "Are you still there?"
  idle_hangup_message: ""

agent_idle_handler:
  typing_enabled: true
  typing_duration_seconds: 5
  audio_clip_id: "guitar_1"
  hold_audio_seconds: 15
  pre_hold_message: "Please wait a moment..."
  hold_message: "Still processing your request..."

Importing voice configurations

After creating your YAML file, import it using the following command:
BASH
orchestrate voice-configs import --file <path-to-your-voice-file>
Note: The import command creates a new voice configuration or updates an existing one based on the name. If a voice configuration with the same name exists, the command updates it using the new configuration.

Listing voice configurations

To list all available voice configurations:
BASH
orchestrate voice-configs list

Getting voice configuration details

To retrieve details of a specific voice configuration:
BASH
orchestrate voice-configs get --name <config_name>

Exporting voice configurations

To export a voice configuration to a YAML file:
BASH
orchestrate voice-configs export --name <config_name> --output <output_path>

Removing voice configurations

To remove a voice configuration by name or ID:
BASH
orchestrate voice-configs remove --name <config_name>
# OR
orchestrate voice-configs remove --id <config_id>
Note: If both --id and --name are provided, the ID takes precedence.
Important: When you remove a voice configuration, agents using this configuration will no longer have voice capabilities. Make sure you want to remove the configuration before you proceed.