Building Applications Frequently Asked Questions

This FAQ contains answers to questions about building VoiceXML applications using Tellme Studio.

  1. How do I prompt users or deliver information to callers?
  2. What formats do you support for recorded audio?
  3. What are the optimal settings for recorded audio?
  4. What happens if I use non-optimal audio file formats?
  5. What does Tellme use for producing text-to-speech?
  6. How good is the text-to-speech engine?
  7. What "voices" are available for text-to-speech?
  8. In what ways can I alter the TTS output?
  9. Does the TTS engine support languages other than English?
Q: What type of language is VoiceXML?
A: VoiceXML is a declarative, XML-based language comprised of elements that describe the human-machine interaction provided by a voice response system. This includes:
  • Output of audio files and synthesized speech (text-to-speech).
  • Recognition of spoken and DTMF input.
  • Control of telephony features such as call transfer and disconnect.
  • Direction of the call flow based on user input
VoiceXML may also embed meta-information, references to other VoiceXML files, and JavaScript code, used to implement client-side logic.

Q: How do you handle speakers with foreign accents?
A: Tellme's work on 1-800-555-TELL , which naturally is designed to be usable by the widest range of speakers, has shown that the Microsoft speech recognition engine employed by the Tellme Platform does a very good job with callers with strong foreign accents.

Q: How do I prompt users or deliver information to callers?
A: Audio output is used to prompt callers for input and to deliver information to them. The audio may be specified as a pre-recorded audio file, or as text that is converted to speech on the fly using the Tellme Voice Application Network's text-to-speech (TTS) engine.

Similar to how image are specified in HTML, the audio file is specified by a URL. The Tellme Platform will retrieve the audio file before it is played to the user. The program can also specify text to be played to the user (using TTS) in the event that the audio file cannot be retrieved.

Pre-recorded audio is almost always preferred in production applications, as it provides the most natural sounding interface. Text-to-speech is useful for rapid application prototyping and for generating output from data whose content cannot be guessed in advance. For example, an application that reads a user's e-mail over the phone.

Q: What formats do you support for recorded audio?
A: WAV files using either PCM or m-law encoding with the following parameters:
Encoding Bit-Length Stereo-Mono Frequencies (KHz)
PCM 8- or 16-bit both 8, 11.025, 16, 22.05, 44.1
m-law 8-bit both 8
a-law 8-bit both 8
Based on business need, we may support other formats in the future.

Q: What are the optimal settings for recorded audio?
A: For optimal playback efficiency, audio should be recorded in 8KHz, 8-bit ulaw (G.711), mono format. Although the Tellme platform supports higher fidelity formats, the current phone network only supports audio at these settings.

Q: What happens if I use non-optimal audio file formats?
A: An application using non-optimal audio file format will operate correctly, but will waste network resources. Non-optimal audio files are converted on the fly to 8KHz, 8-bit, mono m-law as they are played to the user. If a great many non-optimal audio files require conversion at once, the audio servers can become inundated with conversion requests, resulting in poor audio playback. It will also take also take longer to retrieve the larger, high-resolution audio files over the Internet, wasting Internet bandwidth for both Tellme and the application owner.

Q: What does Tellme use for producing text-to-speech?
A: Tellme uses the AT&T Natural Voices TTS engine.

Q: How good is the text-to-speech engine?
A: The TTS engine can convert most English sentences to understandable speech. It also handles certain special cases, such as text containing dollar amounts. For example, it will read "$3.75" as "three dollars and seventy-five cents."

Though text-to-speech technology has come a long way since the talking computer in "War Games", the output still has a computer-generated quality. Pre-recorded audio, though time consuming to create, is the only way to generate a natural, human-sounding voice application interface. The widespread use of TTS within an application is not recommended.

Q: What "voices" are available for text-to-speech?
A: The Tellme TTS engine supports only one voice today, an adult female. Tellme may make additional voices available in the future, depending on business need and customer feedback.

Q: In what ways can I alter the TTS output?
A: None of the TTS parameters (talking speed, pitch, voice, etc.) may be modified by voice applications running on today's Tellme Voice Application Network. These parameters may be exposed in the future based on business need and customer/developer feedback.

Q: Does the TTS engine support languages other than English?
A: No. The TTS engine is designed to produce correct output for English words only. If the engine encounters a word it does not know, it will make the best guess on how it should be pronounced. This will produce poor pronunciations for most foreign languages.

[24]7 Inc.| Terms of Service| Privacy Policy| General Disclaimers