Use the ADK to configure voice features for your agents.
Creating voice configurations
To create a voice configuration, first create a YAML file that defines your voice settings. This file includes the voice name, speech-to-text and text-to-speech provider settings, the primary language, and optional configurations for additional languages and advanced settings.
Supported Providers
Speech-to-Text (STT) Providers:
watson_stt : IBM Watson Speech to Text - Model documentation
deepgram_stt : Deepgram Speech Recognition
Text-to-Speech (TTS) Providers:
watson_tts : IBM Watson Text to Speech - Voice documentation
deepgram_tts : Deepgram Text to Speech
elevenlabs_tts : ElevenLabs Text to Speech
Basic Configuration Examples
Watson STT/TTS Configuration
name : watson_voice_config
speech_to_text :
provider : watson_stt
watson_stt_config :
api_url : "https://api.us-south.speech-to-text.watson.cloud.ibm.com"
api_key : "your_watson_stt_api_key"
model : "en-US_Telephony"
text_to_speech :
provider : watson_tts
watson_tts_config :
api_url : "https://api.us-south.text-to-speech.watson.cloud.ibm.com"
api_key : "your_watson_tts_api_key"
voice : "en-US_AllisonV3Voice"
rate_percentage : 0
pitch_percentage : 0
language : "en-US"
Watson STT Configuration Parameters:
Show detailed parameter descriptions
Watson Speech to Text service URL
Watson Speech to Text API key
Speech recognition model (e.g., “en-US_Telephony”)
Watson TTS Configuration Parameters:
Show detailed parameter descriptions
Watson Text to Speech service URL
Watson Text to Speech API key
Voice model (e.g., “en-US_AllisonV3Voice”)
Speech rate adjustment percentage
Speech pitch adjustment percentage
Language code for the voice
Deepgram STT/TTS Configuration
name : deepgram_voice_config
speech_to_text :
provider : deepgram_stt
deepgram_stt_config :
api_url : "wss://api.deepgram.com/v1/listen"
api_key : "your_deepgram_api_key"
model : "nova-2"
language : "en-US"
numerals : true
mip_opt_out : false
text_to_speech :
provider : deepgram_tts
deepgram_tts_config :
api_key : "your_deepgram_api_key"
language : "en"
voice : "aura-asteria-en"
mip_opt_out : false
language : "en-US"
Deepgram STT Configuration Parameters:
Show detailed parameter descriptions
Deepgram WebSocket API URL (e.g., “wss://api.deepgram.com/v1/listen”)
Speech recognition model (e.g., “nova-2”)
Language code (e.g., “en-US”)
Convert numbers to numerals
Opt out of model improvement program
Deepgram TTS Configuration Parameters:
Show detailed parameter descriptions
Language code (e.g., “en”)
Voice model (e.g., “aura-asteria-en”)
Opt out of model improvement program
ElevenLabs TTS Configuration
name : elevenlabs_voice_config
speech_to_text :
provider : watson_stt
watson_stt_config :
api_url : "https://api.us-south.speech-to-text.watson.cloud.ibm.com"
api_key : "your_watson_stt_api_key"
model : "en-US_Telephony"
text_to_speech :
provider : elevenlabs_tts
elevenlabs_tts_config :
api_key : "your_elevenlabs_api_key"
model_id : "eleven_turbo_v2_5"
voice_id : "21m00Tcm4TlvDq8ikWAM"
language_code : "en"
apply_text_normalization : "auto"
voice_settings :
speed : 1.0
stability : 0.5
style : 0.0
similarity_boost : 0.75
use_speaker_boost : true
language : "en-US"
ElevenLabs TTS Configuration Parameters:
Show detailed parameter descriptions
TTS model ID (e.g., “eleven_turbo_v2_5”)
Language code (e.g., “en”)
Text normalization mode (“auto”, “on”, “off”)
Advanced voice customization settings
Speech speed (default: 1.0)
Voice stability (0.0-1.0, default: 0.5)
Voice style exaggeration (0.0-1.0, default: 0.0)
voice_settings.similarity_boost
Voice similarity enhancement (0.0-1.0, default: 0.75)
voice_settings.use_speaker_boost
Enable speaker boost (default: true)
Advanced Voice Settings
Voice configurations support advanced features for enhanced call handling:
Voice Activity Detection (VAD)
Automatically detect when the user is speaking:
vad :
enabled : true
provider : "silero_vad"
silero_vad_config :
confidence : 0.7 # Speech detection confidence threshold (0.0-1.0)
start_seconds : 0.2 # Time before transitioning to SPEAKING state
stop_seconds : 0.8 # Time before transitioning to QUIET state
min_volume : 0.6 # Minimum audio volume threshold (0.0-1.0)
Enable keypad input during voice calls:
dtmf_input :
inter_digit_timeout_ms : 2500 # Time to wait for next digit (ms)
termination_key : "#" # Key to signal end of input
maximum_count : 10 # Maximum number of digits
ignore_speech : true # Disable speech recognition during DTMF
User Idle Handler
Handle situations when the user stops responding:
user_idle_handler :
enabled : true
idle_timeout : 7 # Timeout in seconds
idle_max_reprompts : 2 # Max reprompt attempts
idle_timeout_message : "Are you still there?" # Message to play
Agent Idle Handler
Provide feedback when the agent is processing:
agent_idle_handler :
typing_enabled : true # Enable typing sound
typing_duration_seconds : 5 # Typing sound duration
audio_clip_id : "guitar_1" # Hold Audio
hold_audio_seconds : 15 # Hold audio duration
pre_hold_message : "Please wait a moment..." # Message before hold audio
hold_message : "Still processing your request..." # Periodic hold message
Complete Advanced Example
name : advanced_voice_config
speech_to_text :
provider : deepgram_stt
deepgram_stt_config :
api_url : "wss://api.deepgram.com/v1/listen"
api_key : "your_deepgram_api_key"
model : "nova-2"
language : "en-US"
numerals : true
mip_opt_out : false
text_to_speech :
provider : deepgram_tts
deepgram_tts_config :
api_key : "your_deepgram_api_key"
language : "en"
voice : "aura-asteria-en"
mip_opt_out : false
language : "en-US"
additional_languages :
es :
text_to_speech :
provider : watson_tts
watson_tts_config :
api_url : "example-tts.url"
api_key : "example tts key"
voice : "example voice"
speech_to_text :
provider : watson_stt
watson_stt_config :
api_url : "example-stt.url"
api_key : "example stt key"
model : "example model"
# Advanced Settings
vad :
enabled : true
provider : "silero_vad"
silero_vad_config :
confidence : 0.7
start_seconds : 0.2
stop_seconds : 0.8
min_volume : 0.6
dtmf_input :
inter_digit_timeout_ms : 2500
termination_key : "#"
maximum_count : 10
ignore_speech : true
user_idle_handler :
enabled : true
idle_timeout : 7
idle_max_reprompts : 2
idle_timeout_message : "Are you still there?"
agent_idle_handler :
typing_enabled : true
typing_duration_seconds : 5
audio_clip_id : "guitar_1"
hold_audio_seconds : 15
pre_hold_message : "Please wait a moment..."
hold_message : "Still processing your request..."
Importing voice configurations
After creating your YAML file, import it using the following command:
orchestrate voice-configs import --file < path-to-your-voice-fil e >
Path to the YAML file with the voice configuration.
Note:
The import command creates a new voice configuration or updates an existing one based on the name. If a voice configuration with the same name exists, the command updates it using the new configuration.
Listing voice configurations
To list all available voice configurations:
orchestrate voice-configs list
Show full details of all voice configurations in JSON format.
Getting voice configuration details
To retrieve details of a specific voice configuration:
orchestrate voice-configs get --name < config_nam e >
Voice configuration ID to retrieve. Either ID or name is required.
Voice configuration name to retrieve. Either ID or name is required.
Exporting voice configurations
To export a voice configuration to a YAML file:
orchestrate voice-configs export --name < config_nam e > --output < output_pat h >
Voice configuration ID to export. Either ID or name is required.
Voice configuration name to export. Either ID or name is required.
Path where the YAML file should be saved.
Removing voice configurations
To remove a voice configuration:
orchestrate voice-configs remove --name < config_nam e >
Name of the voice configuration to remove.
Important:
When you remove a voice configuration, agents using this configuration will no longer have voice capabilities. Make sure you want to remove the configuration before you proceed.