Building Applications Frequently Asked Questions
This FAQ contains answers to questions about building VoiceXML applications using Tellme
- What's a voice application?
- IVR has been around for years. What's so special about these new voice applications?
- What technologies are used to build voice applications?
- Is VoiceXML powerful enough to write full-fledged voice applications?
- What options are there for implementing a voice application?
- What is the architecture of a voice application on the Tellme Platform?
- Why is the Tellme VoiceXML interpreter called a "client"? Doesn't Tellme provide a server-based
- Where is the business logic of the application implemented?
- Does Tellme support static VoiceXML pages as well as dynamically generated pages?
- What technologies do customers use to generate the VoiceXML?
- Does Tellme host a customer's voice applications?
- How can I integrate backend data into my voice application?
- Can I personalize my phone application based on user characteristics?
- Does Tellme automatically convert my application into voice by screen scraping my Web site?
- Can I re-implement my DTMF-based IVR application using VoiceXML on Tellme?
- How does an application maintain state across VoiceXML pages?
- Does the voice-browser accept cookies?
- What's the purpose of VoiceXML?
- What's the history of VoiceXML?
- Are VoiceXML and VXML the same thing?
- Is VoiceXML an open industry standard?
- What version of VoiceXML does Tellme support?
- What is Tellme's role in evolving the VoiceXML as a standard?
- What is Tellme's philosophy regarding open standards?
- Where can I get additional information on VoiceXML?
- How do I tell the application what spoken input to expect?
- Does Tellme provide any "pre-built" grammars?
- What is grammar "tuning"?
- Does Tellme support recognition of languages other than English?
- What are the limits on grammar size?
- How do I prompt users or deliver information to callers?
- What formats do you support for recorded audio?
- What are the optimal settings for recorded audio?
- What happens if I use non-optimal audio file formats?
- What does Tellme use for producing text-to-speech?
- How good is the text-to-speech engine?
- What "voices" are available for text-to-speech?
- In what ways can I alter the TTS output?
- Does the TTS engine support languages other than English?
Q: What's a voice application?|
A: A voice application is a specific type of "phone application". A phone application is an application that
interacts with a person over the telephone in an automated fashion. A bank's automated system that answers a call
and asks the caller to "press one for your account balance, two to transfer funds between accounts, or three to
speak to a customer service representative" is a perfect example of a phone application. Phone applications using
the telephone keypad for input, referred to in the industry as Interactive Voice Response (IVR) applications,
have been commonplace for many years, especially among large companies such as banks, insurance companies, and
Recent advances in speech recognition technology have allowed the creation of a new type of application where the
user interacts with the application by speaking to it rather than entering information through the telephone
keypad. This type of application is called a "voice application".
Q: IVR has been around for years. What's so special about these new voice applications?|
A: First, they provide a natural interface for human-computer interaction over the phone. Callers find
speaking into the telephone more intuitive than pressing keys on a telephone keypad.
Second, they transcend the limitations of the keypad-based interface. The expressiveness of a voice-driven
interface enables much more complex interactions than those supported with the keypad only. For example,
selecting one item from a large list of items, such as a particular stock whose quote you'd like to hear, is
difficult and awkward when using only the keypad. Though schemes exist that allow the "spelling" of the stock
symbol through the keypad, none are as intuitive as just being able to speak the name of the stock.
Third, they implement a hands-free interface perfect for mobile users. IVR applications requiring button presses
pull callers' attention away from other activities. Voice applications allow callers to focus on multiple things
at once. This is especially important if the caller is driving or juggling luggage while running through an
Q: What technologies are used to build voice applications?|
A: There are three layers of technology required to implement a voice application: the telephony layer, the
voice platform layer and the integration layer.
The telephony layer answers incoming calls, performs call management, and connects the caller with a running
instance of an application. This involves the installation and management of carrier connections, switches, call
distributors, and the software necessary to keep them up and running.
The voice platform layer provides the environment in which the voice application is run. It is responsible for
providing the following functionality:
The integration layer links the voice application with computing infrastructure external to the application. This
includes resources such as databases, call-center management systems, transaction processing systems, and legacy
applications. The specific technologies to do this vary based the systems to be integrated.
- Speech recognition. Interprets callers' spoken input.
- Streaming audio. Plays audio files for prompting callers and providing information.
- Text-to-speech. Automatically generates speech when pre-recorded audio isn't available.
- Voice application interpreter. Coordinates playing of prompts, invocation of the speech recognizer, and
implementing application logic according to callers' responses.
Q: What options are there for implementing a voice application?|
A: A business has a continuum of solutions for building and deploying a voice application. These solutions
range from pure do-it-yourself solutions to pure-outsourced solutions.
With a do-it-yourself solution, a business hand crafts each technology layer of the voice application. First, it
negotiates a carrier contract, then acquires and deploys the telephony hardware and software in an appropriate
facility. Next, it evaluates and purchases the various components of the voice platform: the speech recognition
engine, text-to-speech engine, voice application interpreter and so on. Finally, it builds the application plus
the interfaces that integrate it with databases and other applications.
A step up from the pure do-it-yourself solution is to outsource the telephony layer while building the voice
platform and application integration layers. In this solution, the telephony-outsourcing vendor manages the
telephony infrastructure, and provides an environment in which a business runs the other application layers. The
business is still responsible for researching, evaluating and assembling their own voice platform in addition to
building the application itself.
The next solution, and the one implemented by Tellme, is one step beyond the "hosted telephony" solution. The
Tellme solution is to allow businesses to outsource both the telephony and voice platform layers, so that they
need only build the application and integrate it with their back-end systems. This allows them to concentrate on
building the logic of their applications rather than worrying about the complexities of deploying a robust,
scalable voice platform and telephony infrastructure.
The final voice application solution is to outsource all technology layers. With this solution, a business allows
the outsourcing provider to manage the telephony infrastructure, provide the voice platform and also specify the
integration mechanisms. This solution lets businesses concentrate purely on building their application, but
limits the back-end systems to which they can integrate, to those specifically supported by the outsourcing
vendor. Since many businesses have proprietary back-end systems and custom integration requirements, these
solutions are appropriate only for the simplest of applications.
Q: What is the architecture of a voice application on the Tellme Platform?|
A: The Tellme voice application architecture integrates powerful voice-recognition technology with the
familiar application model of the Web, providing a mechanism to quickly and easily augment existing Web
applications with voice-driven interfaces. It is easiest to explain the Tellme voice application architecture by
comparing it with the architecture of a Web application.
A Web application is implemented as a series of HTML pages retrieved from a Web server by a browser. The
browser's job is to retrieve pages from the Web server over HTTP, and to visually render these pages for the
user. User input collected through HTML forms, is passed to the server via HTTP requests for processing, and the
server generates a response back to the browser. This processing typically involves back end business logic,
legacy system integration and database access.
Tellme has harnessed the simplicity of this well-understood architecture, and applied it to the development of
voice applications. The architecture of a voice application is very similar to that of a Web application with the
In all other respects, the architectures are the same. They implement a stateless request/response application
model. The client interacts with the user to collect input and sends HTTP GET or POST requests to pass the input
to the server and present the next "page" of the user interface. Application logic and legacy system integration
is implemented on the server through common Web back-end technologies such as CGI, NSAPI, ASP or JSP. In fact,
voice and Web applications often use the same back-end infrastructure components. The only difference is that the
Web application presents data visually using HTML whereas the voice application presents data audibly using
VoiceXML. Finally, even the client-side logic capabilities are the same, with VoiceXML supporting embedded
- The user interface of the application is a voice-driven call-flow instead of a visual Web page.
- The interface is represented as a sequence of VoiceXML pages, instead of HTML pages.
- The Tellme VoiceXML interpreter plays the role of the Web browser, fetching VoiceXML pages from the Web
server and rendering them over the phone as voice and DTMF-driven call flows.
Q: Why is the Tellme VoiceXML interpreter called a "client"? Doesn't Tellme provide a server-based
A: Technically, yes. The Tellme Platform is composed of a bunch of servers managing phone calls and running
VoiceXML applications. However, architecturally speaking, the Tellme Platform plays the role of a client speaking
to a Web server over the Internet.
One way of looking at this is to think of the Tellme Platform as a mechanism that converts a normal telephone
into a type of Web browser that interacts with the user via speech rather than visually. Both the voice browser
and Web browser are Web server clients.
Q: Where is the business logic of the application implemented?|
A: Just as with a Web application, the business logic of a voice application may be implemented on the
client-side (the Tellme Platform), the server-side, or both. Most applications split the application logic across
Client-side logic runs on the Tellme platform as part of interpreting the VoiceXML page. It is implemented using
it has the ability to directly alter the behavior of the user interface. This makes it ideal for performing
UI-related tasks such as validating data, randomizing user prompts (to give the interface a more human feel), and
modifying call parameters and behavior on-the-fly (such as timeout durations).
Server-side logic, on the other hand, runs at the customer site as part of the process of dynamically generating
VoiceXML pages based on requests from the client portion of the application. It is implemented using the same
technologies used with Web applications. While different applications often use very different server-side
implementation technologies, it doesn't matter to the Tellme Platform as long as valid VoiceXML is generated as a
Since the server-side logic runs at the customer site, it is ideally situated to access customer databases and
integrate with other computer systems. This also provides the opportunity to re-use the existing Web application
infrastructure and logic. For example, code modules responsible for accessing databases, enforcing security
policies and implementing business rules may often be shared across Web and voice applications. This sharing can
dramatically simplify voice application development by focusing efforts on designing and building the voice
interface instead of re-inventing the back-end integration mechanisms.
Q: Does Tellme support static VoiceXML pages as well as dynamically generated pages?|
A: Yes. The platform makes a standard HTTP request to retrieve the VoiceXML pages to execute. It makes no
difference to the platform whether the page is stored statically on the Web server, or whether it is dynamically
generated. In fact, most applications employ a combination of the two.|
Q: What technologies do customers use to generate the VoiceXML?|
A: Any technology used to generate Web pages may be used to generate VoiceXML pages.|
Q: Does Tellme host a customer's voice applications?|
A: No. The VoiceXML files are completely managed by the customer. The Tellme VoiceXML interpreter retrieves
them across the Internet from the customer's Web site.|
Q: How can I integrate backend data into my voice application?|
You can integrate backend data into your voice applications
by generating VoiceXML using a server-side framework such as CGI, ASP, or JSP.
Using whatever database API is supported by these server-side frameworks (DBI, ODBC, OLE/DB),
access your backend database, generate VoiceXML on the fly containing that data, and return it
to the Tellme VoiceXML interpreter.|
Q: Can I personalize my phone application based on user characteristics?|
A: Yes. By using the same techniques as in a Web application, a voice application may access user profile
data stored in a database to generate personalized VoiceXML pages.|
Q: Does Tellme automatically convert my application into voice by screen scraping my Web site?|
A: No. Some companies have technology that can "read" a Web page and convert it on the fly into a voice
application. Tellme doesn't do this because there are fundamental differences between well-designed Web and
speech user interfaces. Programs automatically converting a Web interface to speech produce rudimentary
applications which fail to exploit speech's unique strengths and which don't take into account particular
For example, imagine two different Web pages to be converted to speech. One contains a list of step-by-step
driving directions, the other a list of stocks and quotes for a portfolio. Callers asking for driving directions
are likely to be calling from their car, and will want the directions read back to them one at a time, as they
complete each step. The application should pause between each step and wait for the user's command to proceed. On
the other hand, the list of stocks and their prices should be read back at a more rapid pace, automatically
proceeding to the next quote after a short pause. Users would find it tedious and annoying if they were forced to
indicate each time they are ready to hear the next quote. As you can see, list navigation can be very context
sensitive. A program automatically converting a Web page to speech would have no ability to discern the
difference between these applications and would produce applications poorly suited to their intended
Q: Can I re-implement my DTMF-based IVR application using VoiceXML on Tellme?|
A: Yes. The Tellme Voice Application Network provides robust support for DTMF and voice applications as part
of the VoiceXML standard. Well-designed voice applications can provide support for both simultaneously.|
Q: How does an application maintain state across VoiceXML pages?|
execution of VoiceXML pages.
Just like a Web browser, the Tellme Platform allows a Web site to set and retrieve cookies associated with a
user's session. A session begins when Tellme answers a call, and ends when the call is finished. Cookies on the
Tellme Platform follow all of the same rules as Web cookies with regards to content, expiration periods and
security. By default, all cookies are session cookies, and will disappear at the end of the user's call.
Persistent cookies may also be created, but require that callers first be identified using their Tellme sign-in
numbers and passwords. Please contact Tellme for further information on this capability.
Values stored in these variables may be accessed anywhere within the context of the same VoiceXML
Q: Does the voice-browser accept cookies?|
A: Yes. See the
on maintaining application state.|
Q: Is VoiceXML powerful enough to write full-fledged voice applications?|
A: Yes, absolutely! Every one of the applications running on
set of voice applications ever built.
Another way to think about it is to consider that a voice application is nothing but a Web application with a
speech-driven interface. Just like Web applications, voice applications can perform hard-core data processing,
integration of disparate data sources and legacy systems, and complex user interactions.
Q: What's the purpose of VoiceXML?|
A: From the VoiceXML 1.0 Specification:
VoiceXML's main goal is to bring the full power of Web development and content delivery to voice
response applications, and to free the authors of such applications from low-level programming and resource
management. It enables integration of voice services with data services using the familiar client-server
In other words, VoiceXML provides a way to build a voice application that doesn't rely upon proprietary
techniques, but instead leverages the same well-known infrastructure used to build Web sites. This opens up the
world of voice applications to a wider audience of developers, and simplifies the integration of phone-based
interfaces with existing applications and Web infrastructures.
Q: What type of language is VoiceXML?|
A: VoiceXML is a declarative, XML-based language comprised of elements that describe the human-machine
interaction provided by a voice response system. This includes:
implement client-side logic.
- Output of audio files and synthesized speech (text-to-speech).
- Recognition of spoken and DTMF input.
- Control of telephony features such as call transfer and disconnect.
- Direction of the call flow based on user input
Q: What's the history of VoiceXML?|
A: The VoiceXML language is a product of the VoiceXML Forum, an industry consortium led by AT&T, IBM,
Lucent and Motorola. The Forum was created to develop and promote the standards necessary to jump-start the
growth of speech-enabled Internet applications. In August 1999, the forum released version 0.9 of the VoiceXML
specification. The final 1.0 specification was released in March 2000 after industry review and comment.
Though the VoiceXML forum was led by industry heavyweights, it needed to make VoiceXML an official standard to
guarantee the widespread industry support that would make VoiceXML a success. In May 2000, the Forum submitted
the VoiceXML 1.0 specification to the W3C, where it was accepted by the Voice Browser working group as the basis
for developing a standard language for interactive voice response applications.
VoiceXML 2.0 became a Full W3C Recommendation in March of 2005.
VoiceXML 2.1 became a W3C Candidate Recommendation in June of 2005.
At the time of this writing VoiceXML 2.1 is only two small steps away from becoming a Full W3C Recommendation.
Q: Are VoiceXML and VXML the same thing?|
A: Yes. "VoiceXML" is the official name of the specification. "VXML" is a commonly used
Q: Is VoiceXML an open industry standard?|
A: The VoiceXML 1.0 specification has not yet been officially ordained by the W3C as a "standard", but it's
the only such language that is both undergoing standardization and that has widespread industry support. In
addition to the four founding partners, the VoiceXML forum and the VoiceXML specification is supported by almost
200 companies including 3Com, Ericcson, France Telecom, General Magic, Hewlett Packard, Interactive Telesis,
Nortel, Oracle, Siemens, Speechworks International and of course, Tellme.
In fact, multiple companies have already developed, or are committed to developing, VoiceXML-based platforms.
These include Microsoft Tellme, Nuance, Speechworks, IBM, Motorola and Lucent.
Q: What version of VoiceXML does Tellme support?|
The Tellme Voice Browser supports both the
VoiceXML 2.0 and
VoiceXML 2.1 specifications.
Q: What is Tellme's role in evolving the VoiceXML as a standard?|
A: The W3C has accepted VoiceXML 1.0 as a note, and is using it as the basis for defining a Speech Dialog
Markup Language. Tellme has been accepted as a member of the W3C's Voice Browser working group, and has several
people actively working in several sub-groups, including the Speech Dialog Markup Language, Grammar and Speech
Synthesis sub-groups. Tellme participates in the weekly Voice Browser conference call and may soon take on
specification-drafting/editing roles as well.
Tellme was recently selected to be one of the six Editors co-authoring the VoiceXML specification in the W3C.
Tellme is also an active, supporting member of the VoiceXML Forum and participated in the last VoiceXML Forum
meeting held during Spring Internet World in April 2000.
Tellme is committed to implementing VoiceXML, while helping to drive its evolution based on implementation and
real-world experience. By extending the language to meet the needs of the development community and working with
the W3C and VoiceXML Forum to have those extensions adopted in a vendor-independent fashion, Tellme will advance
the state-of-the-art in voice application development while maintaining cross-vendor compatibility.
Q: What is Tellme's philosophy regarding open standards?|
A: Tellme believes that customers are tired of proprietary solutions and is committed to providing
best-of-breed technology through open standard protocols and interfaces. This includes not just VoiceXML, but
Not only is Tellme building products that comply with appropriate industry standards, but Tellme is also a leader
in the voice application space and is driving these standards based on its experience building and running the
first, commercial service based on VoiceXML.
Q: How do I tell the application what spoken input to expect?|
A: The voice application uses a grammar to define what utterances are legal for the caller to say at a
particular point in the application. (An utterance is speech input before it has been recognized by the voice
recognizer as a specific response.) A grammar represents the set of accepted inputs via a list of regular
expressions. A grammar representing the possible answers to the question "How do you travel from home to work?"
might specify the possible utterances like this in SRGS/GRXML format:
The VoiceXML language lets the developer specify which grammars are in force at any point in the application via
a grammar scoping capability. Grammars are included into a VoiceXML file either in-line, or through references to
external grammar files.
The Tellme Platform currently supports SRGS/GRXML grammar format, with legacy support for the GSL grammar format.
Q: Does Tellme provide any "pre-built" grammars?|
A: Yes, the Tellme Platform provides access to many pre-built grammars. These grammars are either commonly
used, difficult to create or require constant maintenance. Tellme currently provides many pre-built grammars,
contains the full list of grammars and their descriptions.
- General: Yes/No
- Credit Cards: Expiration date, Expiration month, Expiration Year, Credit Card Number
- Date/Time: Day of month, Day of year, Month, Year, Date, AM/PM, TimeDuration (in minutes or days), Hour,
- Financial: US Dollars (no cents), US Money (dollars and cents),
- Locations: City/State
- Numbers: Digits, Natural numbers, Percentages, Social-Security Numbers
- Telephone: US Phone number, Phone extension, Area Code, 7-digit phone, 10-digit phone
Q: What is grammar "tuning"?|
A: A voice application, like any user-centric application, is prone to certain problems that may only be
discovered through formal usability testing, or observation of the application in use. Poor speech recognition
accuracy is one type of problem common to voice applications, and a problem most often caused by poor grammar
implementation. When users mispronounce words or say things unexpected by the grammar designer, the recognizer
cannot match their input against the grammar. Poorly designed grammars containing many difficult-to-distinguish
entries will also result in many misrecognized inputs.
Grammar tuning is the process of improving recognition accuracy by modifying a grammar based on an analysis of
its performance. Tuning is often performed during an iterative process of usability testing and application
improvement and may involve amending the grammar with commonly spoken phrases, removing highly confusable words,
and adding additional ways that callers may pronounce a word.
Q: How do you handle speakers with foreign accents?|
A: Tellme's work on
, which naturally is designed to be usable by the widest range of speakers, has shown that the Microsoft speech
recognition engine employed by the Tellme Platform does a very good job with callers with strong foreign
Q: Does Tellme support recognition of languages other than English?|
A: Not currently. For now, Tellme is focused on maximizing reliability of English recognition over the
phone. Adding a new language is a non-trivial task. Even if a model exists which lets a recognizer understand a
native speaker under ideal acoustic conditions, making that same model work reliably when confronted with the
noisy audio environment of the phone is very difficult.|
Q: What are the limits on grammar size?|
A: Though the speech recognizer doesn't place a specific limit on grammar size, several practical
considerations serve to effectively limit the maximum grammar size to tens of thousands of entries. First, as the
grammar grows in size, the recognizer must test an utterance against a larger number of possibilities, slowing
the recognizer considerably. Second, the larger the grammar, the greater the possibility that it contains words
that are easily confusable. This lowers the accuracy of the grammar. However, with careful planning, it is
possible to create large grammars that are still effective. Tellme has production grammars with up to thirty
Q: How do I prompt users or deliver information to callers?|
A: Audio output is used to prompt callers for input and to deliver information to them. The audio may be
specified as a pre-recorded audio file, or as text that is converted to speech on the fly using the Tellme Voice
Application Network's text-to-speech (TTS) engine.
Similar to how image are specified in HTML, the audio file is specified by a URL. The Tellme Platform will
retrieve the audio file before it is played to the user. The program can also specify text to be played to the
user (using TTS) in the event that the audio file cannot be retrieved.
Pre-recorded audio is almost always preferred in production applications, as it provides the most natural
sounding interface. Text-to-speech is useful for rapid application prototyping and for generating output from
data whose content cannot be guessed in advance. For example, an application that reads a user's e-mail over the
Q: What formats do you support for recorded audio?|
A: WAV files using either PCM or m-law encoding with the following parameters:
Based on business need, we may support other formats in the future.
||8- or 16-bit
||8, 11.025, 16, 22.05, 44.1
Q: What are the optimal settings for recorded audio?|
A: For optimal playback efficiency, audio should be recorded in 8KHz, 8-bit ulaw (G.711), mono format.
Although the Tellme platform supports higher fidelity formats, the current phone network only supports audio at
Q: What happens if I use non-optimal audio file formats?|
A: An application using non-optimal audio file format will operate correctly, but will waste network
resources. Non-optimal audio files are converted on the fly to 8KHz, 8-bit, mono m-law as they are played to the
user. If a great many non-optimal audio files require conversion at once, the audio servers can become inundated
with conversion requests, resulting in poor audio playback. It will also take also take longer to retrieve the
larger, high-resolution audio files over the Internet, wasting Internet bandwidth for both Tellme and the
Q: How good is the text-to-speech engine?|
A: The TTS engine can convert most English sentences to understandable speech. It also handles certain
special cases, such as text containing dollar amounts. For example, it will read "$3.75" as "three dollars and
Though text-to-speech technology has come a long way since the talking computer in "War Games", the output still
has a computer-generated quality. Pre-recorded audio, though time consuming to create, is the only way to
generate a natural, human-sounding voice application interface. The widespread use of TTS within an application
is not recommended.
Q: What "voices" are available for text-to-speech?|
A: The Tellme TTS engine supports only one voice today, an adult female. Tellme may make additional voices
available in the future, depending on business need and customer feedback.|
Q: In what ways can I alter the TTS output?|
A: None of the TTS parameters (talking speed, pitch, voice, etc.) may be modified by voice applications
running on today's Tellme Voice Application Network. These parameters may be exposed in the future based on
business need and customer/developer feedback.|
Q: Does the TTS engine support languages other than English?|
A: No. The TTS engine is designed to produce correct output for English words only. If the engine encounters
a word it does not know, it will make the best guess on how it should be pronounced. This will produce poor
pronunciations for most foreign languages.|