Skip to main content
Use the ADK to configure voice features for your agents.
Note: To use voice in watsonx Orchestrate Developer Edition, enable the voice feature by adding the --with-voice flag to the orchestrate server start command. For more information, see Installing watsonx Orchestrate Developer Edition: watsonx Orchestrate server.

Creating voice configurations

To create a voice configuration, first create a YAML file that defines your voice settings. This file includes the voice name, speech-to-text and text-to-speech provider settings, the primary language, and optional configurations for additional languages and advanced settings.

Supported Providers

Speech-to-Text (STT) Providers:
  • watson_stt: IBM Watson Speech to Text - Model documentation
  • deepgram_stt: Deepgram Speech Recognition
Text-to-Speech (TTS) Providers:
  • watson_tts: IBM Watson Text to Speech - Voice documentation
  • deepgram_tts: Deepgram Text to Speech
  • elevenlabs_tts: ElevenLabs Text to Speech

Basic Configuration Examples

Watson STT/TTS Configuration

YAML
name: watson_voice_config
speech_to_text:
  provider: watson_stt
  watson_stt_config:
    api_url: "https://api.us-south.speech-to-text.watson.cloud.ibm.com"
    api_key: "your_watson_stt_api_key"
    model: "en-US_Telephony"
text_to_speech:
  provider: watson_tts
  watson_tts_config:
    api_url: "https://api.us-south.text-to-speech.watson.cloud.ibm.com"
    api_key: "your_watson_tts_api_key"
    voice: "en-US_AllisonV3Voice"
    rate_percentage: 0
    pitch_percentage: 0
language: "en-US"
Watson STT Configuration Parameters: Watson TTS Configuration Parameters:

Deepgram STT/TTS Configuration

YAML
name: deepgram_voice_config
speech_to_text:
  provider: deepgram_stt
  deepgram_stt_config:
    api_url: "wss://api.deepgram.com/v1/listen"
    api_key: "your_deepgram_api_key"
    model: "nova-2"
    language: "en-US"
    numerals: true
    mip_opt_out: false
text_to_speech:
  provider: deepgram_tts
  deepgram_tts_config:
    api_key: "your_deepgram_api_key"
    language: "en"
    voice: "aura-asteria-en"
    mip_opt_out: false
language: "en-US"
Deepgram STT Configuration Parameters: Deepgram TTS Configuration Parameters:

ElevenLabs TTS Configuration

YAML
name: elevenlabs_voice_config
speech_to_text:
  provider: watson_stt
  watson_stt_config:
    api_url: "https://api.us-south.speech-to-text.watson.cloud.ibm.com"
    api_key: "your_watson_stt_api_key"
    model: "en-US_Telephony"
text_to_speech:
  provider: elevenlabs_tts
  elevenlabs_tts_config:
    api_key: "your_elevenlabs_api_key"
    model_id: "eleven_turbo_v2_5"
    voice_id: "21m00Tcm4TlvDq8ikWAM"
    language_code: "en"
    apply_text_normalization: "auto"
    voice_settings:
      speed: 1.0
      stability: 0.5
      style: 0.0
      similarity_boost: 0.75
      use_speaker_boost: true
language: "en-US"
ElevenLabs TTS Configuration Parameters:

Advanced Voice Settings

Voice configurations support advanced features for enhanced call handling:

Voice Activity Detection (VAD)

Automatically detect when the user is speaking:
YAML
vad:
  enabled: true
  provider: "silero_vad"
  silero_vad_config:
    confidence: 0.7          # Speech detection confidence threshold (0.0-1.0)
    start_seconds: 0.2       # Time before transitioning to SPEAKING state
    stop_seconds: 0.8        # Time before transitioning to QUIET state
    min_volume: 0.6          # Minimum audio volume threshold (0.0-1.0)

DTMF Input (Dual-Tone Multi-Frequency)

Enable keypad input during voice calls:
YAML
dtmf_input:
  inter_digit_timeout_ms: 2500  # Time to wait for next digit (ms)
  termination_key: "#"          # Key to signal end of input
  maximum_count: 10             # Maximum number of digits
  ignore_speech: true           # Disable speech recognition during DTMF

User Idle Handler

Handle situations when the user stops responding:
YAML
user_idle_handler:
  enabled: true
  idle_timeout: 7                                      # Timeout in seconds
  idle_max_reprompts: 2                                # Max reprompt attempts
  idle_timeout_message: "Are you still there?"         # Message to play

Agent Idle Handler

Provide feedback when the agent is processing:
YAML
agent_idle_handler:
  typing_enabled: true                                 # Enable typing sound
  typing_duration_seconds: 5                           # Typing sound duration
  audio_clip_id: "guitar_1"                            # Hold Audio
  hold_audio_seconds: 15                               # Hold audio duration
  pre_hold_message: "Please wait a moment..."          # Message before hold audio
  hold_message: "Still processing your request..."     # Periodic hold message

Complete Advanced Example

YAML
name: advanced_voice_config
speech_to_text:
  provider: deepgram_stt
  deepgram_stt_config:
    api_url: "wss://api.deepgram.com/v1/listen"
    api_key: "your_deepgram_api_key"
    model: "nova-2"
    language: "en-US"
    numerals: true
    mip_opt_out: false
text_to_speech:
  provider: deepgram_tts
  deepgram_tts_config:
    api_key: "your_deepgram_api_key"
    language: "en"
    voice: "aura-asteria-en"
    mip_opt_out: false
language: "en-US"
additional_languages:
  es:
    text_to_speech:
      provider: watson_tts
      watson_tts_config: 
        api_url: "example-tts.url"
        api_key: "example tts key"
        voice: "example voice"
    speech_to_text:
      provider: watson_stt
      watson_stt_config: 
        api_url: "example-stt.url"
        api_key: "example stt key"
        model: "example model"

# Advanced Settings
vad:
  enabled: true
  provider: "silero_vad"
  silero_vad_config:
    confidence: 0.7
    start_seconds: 0.2
    stop_seconds: 0.8
    min_volume: 0.6

dtmf_input:
  inter_digit_timeout_ms: 2500
  termination_key: "#"
  maximum_count: 10
  ignore_speech: true

user_idle_handler:
  enabled: true
  idle_timeout: 7
  idle_max_reprompts: 2
  idle_timeout_message: "Are you still there?"

agent_idle_handler:
  typing_enabled: true
  typing_duration_seconds: 5
  audio_clip_id: "guitar_1"
  hold_audio_seconds: 15
  pre_hold_message: "Please wait a moment..."
  hold_message: "Still processing your request..."

Importing voice configurations

After creating your YAML file, import it using the following command:
BASH
orchestrate voice-configs import --file <path-to-your-voice-file>
Note: The import command creates a new voice configuration or updates an existing one based on the name. If a voice configuration with the same name exists, the command updates it using the new configuration.

Listing voice configurations

To list all available voice configurations:
BASH
orchestrate voice-configs list

Getting voice configuration details

To retrieve details of a specific voice configuration:
BASH
orchestrate voice-configs get --name <config_name>

Exporting voice configurations

To export a voice configuration to a YAML file:
BASH
orchestrate voice-configs export --name <config_name> --output <output_path>

Removing voice configurations

To remove a voice configuration:
BASH
orchestrate voice-configs remove --name <config_name>
Important: When you remove a voice configuration, agents using this configuration will no longer have voice capabilities. Make sure you want to remove the configuration before you proceed.