Building Applications Frequently Asked Questions

This FAQ contains answers to questions about building VoiceXML applications using Tellme Studio.

  1. What's a voice application?
  2. IVR has been around for years. What's so special about these new voice applications?
  3. What technologies are used to build voice applications?
  4. Is VoiceXML powerful enough to write full-fledged voice applications?
  1. What options are there for implementing a voice application?
  2. What is the architecture of a voice application on the Tellme Platform?
  3. Why is the Tellme VoiceXML interpreter called a "client"? Doesn't Tellme provide a server-based solution?
  4. Where is the business logic of the application implemented?
  5. Does Tellme support static VoiceXML pages as well as dynamically generated pages?
  6. What technologies do customers use to generate the VoiceXML?
  7. Does Tellme host a customer's voice applications?
  8. How can I integrate backend data into my voice application?
  9. Can I personalize my phone application based on user characteristics?
  10. Does Tellme automatically convert my application into voice by screen scraping my Web site?
  11. Can I re-implement my DTMF-based IVR application using VoiceXML on Tellme?
  12. How does an application maintain state across VoiceXML pages?
  13. Does the voice-browser accept cookies?
  1. What's the purpose of VoiceXML?
  2. What's the history of VoiceXML?
  3. Are VoiceXML and VXML the same thing?
  4. Is VoiceXML an open industry standard?
  5. What version of VoiceXML does Tellme support?
  6. What is Tellme's role in evolving the VoiceXML as a standard?
  7. What is Tellme's philosophy regarding open standards?
  8. Where can I get additional information on VoiceXML?
  1. How do I tell the application what spoken input to expect?
  2. Does Tellme provide any "pre-built" grammars?
  3. What is grammar "tuning"?
  4. Does Tellme support recognition of languages other than English?
  5. What are the limits on grammar size?
  1. How do I prompt users or deliver information to callers?
  2. What formats do you support for recorded audio?
  3. What are the optimal settings for recorded audio?
  4. What happens if I use non-optimal audio file formats?
  5. What does Tellme use for producing text-to-speech?
  6. How good is the text-to-speech engine?
  7. What "voices" are available for text-to-speech?
  8. In what ways can I alter the TTS output?
  9. Does the TTS engine support languages other than English?
Q: What's a voice application?
A: A voice application is a specific type of "phone application". A phone application is an application that interacts with a person over the telephone in an automated fashion. A bank's automated system that answers a call and asks the caller to "press one for your account balance, two to transfer funds between accounts, or three to speak to a customer service representative" is a perfect example of a phone application. Phone applications using the telephone keypad for input, referred to in the industry as Interactive Voice Response (IVR) applications, have been commonplace for many years, especially among large companies such as banks, insurance companies, and airlines.

Recent advances in speech recognition technology have allowed the creation of a new type of application where the user interacts with the application by speaking to it rather than entering information through the telephone keypad. This type of application is called a "voice application".

Q: IVR has been around for years. What's so special about these new voice applications?
A: First, they provide a natural interface for human-computer interaction over the phone. Callers find speaking into the telephone more intuitive than pressing keys on a telephone keypad.

Second, they transcend the limitations of the keypad-based interface. The expressiveness of a voice-driven interface enables much more complex interactions than those supported with the keypad only. For example, selecting one item from a large list of items, such as a particular stock whose quote you'd like to hear, is difficult and awkward when using only the keypad. Though schemes exist that allow the "spelling" of the stock symbol through the keypad, none are as intuitive as just being able to speak the name of the stock.

Third, they implement a hands-free interface perfect for mobile users. IVR applications requiring button presses pull callers' attention away from other activities. Voice applications allow callers to focus on multiple things at once. This is especially important if the caller is driving or juggling luggage while running through an airport.

Q: What technologies are used to build voice applications?
A: There are three layers of technology required to implement a voice application: the telephony layer, the voice platform layer and the integration layer.

The telephony layer answers incoming calls, performs call management, and connects the caller with a running instance of an application. This involves the installation and management of carrier connections, switches, call distributors, and the software necessary to keep them up and running.

The voice platform layer provides the environment in which the voice application is run. It is responsible for providing the following functionality:
  • Speech recognition. Interprets callers' spoken input.
  • Streaming audio. Plays audio files for prompting callers and providing information.
  • Text-to-speech. Automatically generates speech when pre-recorded audio isn't available.
  • Voice application interpreter. Coordinates playing of prompts, invocation of the speech recognizer, and implementing application logic according to callers' responses.
The integration layer links the voice application with computing infrastructure external to the application. This includes resources such as databases, call-center management systems, transaction processing systems, and legacy applications. The specific technologies to do this vary based the systems to be integrated.

Q: What options are there for implementing a voice application?
A: A business has a continuum of solutions for building and deploying a voice application. These solutions range from pure do-it-yourself solutions to pure-outsourced solutions.

With a do-it-yourself solution, a business hand crafts each technology layer of the voice application. First, it negotiates a carrier contract, then acquires and deploys the telephony hardware and software in an appropriate facility. Next, it evaluates and purchases the various components of the voice platform: the speech recognition engine, text-to-speech engine, voice application interpreter and so on. Finally, it builds the application plus the interfaces that integrate it with databases and other applications.

A step up from the pure do-it-yourself solution is to outsource the telephony layer while building the voice platform and application integration layers. In this solution, the telephony-outsourcing vendor manages the telephony infrastructure, and provides an environment in which a business runs the other application layers. The business is still responsible for researching, evaluating and assembling their own voice platform in addition to building the application itself.

The next solution, and the one implemented by Tellme, is one step beyond the "hosted telephony" solution. The Tellme solution is to allow businesses to outsource both the telephony and voice platform layers, so that they need only build the application and integrate it with their back-end systems. This allows them to concentrate on building the logic of their applications rather than worrying about the complexities of deploying a robust, scalable voice platform and telephony infrastructure.

The final voice application solution is to outsource all technology layers. With this solution, a business allows the outsourcing provider to manage the telephony infrastructure, provide the voice platform and also specify the integration mechanisms. This solution lets businesses concentrate purely on building their application, but limits the back-end systems to which they can integrate, to those specifically supported by the outsourcing vendor. Since many businesses have proprietary back-end systems and custom integration requirements, these solutions are appropriate only for the simplest of applications.

Q: What is the architecture of a voice application on the Tellme Platform?
A: The Tellme voice application architecture integrates powerful voice-recognition technology with the familiar application model of the Web, providing a mechanism to quickly and easily augment existing Web applications with voice-driven interfaces. It is easiest to explain the Tellme voice application architecture by comparing it with the architecture of a Web application.

A Web application is implemented as a series of HTML pages retrieved from a Web server by a browser. The browser's job is to retrieve pages from the Web server over HTTP, and to visually render these pages for the user. User input collected through HTML forms, is passed to the server via HTTP requests for processing, and the server generates a response back to the browser. This processing typically involves back end business logic, legacy system integration and database access.

Tellme has harnessed the simplicity of this well-understood architecture, and applied it to the development of voice applications. The architecture of a voice application is very similar to that of a Web application with the following differences:
  1. The user interface of the application is a voice-driven call-flow instead of a visual Web page.
  2. The interface is represented as a sequence of VoiceXML pages, instead of HTML pages.
  3. The Tellme VoiceXML interpreter plays the role of the Web browser, fetching VoiceXML pages from the Web server and rendering them over the phone as voice and DTMF-driven call flows.
In all other respects, the architectures are the same. They implement a stateless request/response application model. The client interacts with the user to collect input and sends HTTP GET or POST requests to pass the input to the server and present the next "page" of the user interface. Application logic and legacy system integration is implemented on the server through common Web back-end technologies such as CGI, NSAPI, ASP or JSP. In fact, voice and Web applications often use the same back-end infrastructure components. The only difference is that the Web application presents data visually using HTML whereas the voice application presents data audibly using VoiceXML. Finally, even the client-side logic capabilities are the same, with VoiceXML supporting embedded JavaScript, just like HTML.

Q: Why is the Tellme VoiceXML interpreter called a "client"? Doesn't Tellme provide a server-based solution?
A: Technically, yes. The Tellme Platform is composed of a bunch of servers managing phone calls and running VoiceXML applications. However, architecturally speaking, the Tellme Platform plays the role of a client speaking to a Web server over the Internet.

One way of looking at this is to think of the Tellme Platform as a mechanism that converts a normal telephone into a type of Web browser that interacts with the user via speech rather than visually. Both the voice browser and Web browser are Web server clients.

Q: Where is the business logic of the application implemented?
A: Just as with a Web application, the business logic of a voice application may be implemented on the client-side (the Tellme Platform), the server-side, or both. Most applications split the application logic across both environments.

Client-side logic runs on the Tellme platform as part of interpreting the VoiceXML page. It is implemented using a combination of VoiceXML directives and embedded JavaScript. Since it is executed while the call is in-progress, it has the ability to directly alter the behavior of the user interface. This makes it ideal for performing UI-related tasks such as validating data, randomizing user prompts (to give the interface a more human feel), and modifying call parameters and behavior on-the-fly (such as timeout durations).

Server-side logic, on the other hand, runs at the customer site as part of the process of dynamically generating VoiceXML pages based on requests from the client portion of the application. It is implemented using the same technologies used with Web applications. While different applications often use very different server-side implementation technologies, it doesn't matter to the Tellme Platform as long as valid VoiceXML is generated as a result.

Since the server-side logic runs at the customer site, it is ideally situated to access customer databases and integrate with other computer systems. This also provides the opportunity to re-use the existing Web application infrastructure and logic. For example, code modules responsible for accessing databases, enforcing security policies and implementing business rules may often be shared across Web and voice applications. This sharing can dramatically simplify voice application development by focusing efforts on designing and building the voice interface instead of re-inventing the back-end integration mechanisms.

Q: Does Tellme support static VoiceXML pages as well as dynamically generated pages?
A: Yes. The platform makes a standard HTTP request to retrieve the VoiceXML pages to execute. It makes no difference to the platform whether the page is stored statically on the Web server, or whether it is dynamically generated. In fact, most applications employ a combination of the two.

Q: What technologies do customers use to generate the VoiceXML?
A: Any technology used to generate Web pages may be used to generate VoiceXML pages.

Q: Does Tellme host a customer's voice applications?
A: No. The VoiceXML files are completely managed by the customer. The Tellme VoiceXML interpreter retrieves them across the Internet from the customer's Web site.

Q: How can I integrate backend data into my voice application?
A: You can integrate backend data into your voice applications by generating VoiceXML using a server-side framework such as CGI, ASP, or JSP. Using whatever database API is supported by these server-side frameworks (DBI, ODBC, OLE/DB), access your backend database, generate VoiceXML on the fly containing that data, and return it to the Tellme VoiceXML interpreter.

Q: Can I personalize my phone application based on user characteristics?
A: Yes. By using the same techniques as in a Web application, a voice application may access user profile data stored in a database to generate personalized VoiceXML pages.

Q: Does Tellme automatically convert my application into voice by screen scraping my Web site?
A: No. Some companies have technology that can "read" a Web page and convert it on the fly into a voice application. Tellme doesn't do this because there are fundamental differences between well-designed Web and speech user interfaces. Programs automatically converting a Web interface to speech produce rudimentary applications which fail to exploit speech's unique strengths and which don't take into account particular customer scenarios.

For example, imagine two different Web pages to be converted to speech. One contains a list of step-by-step driving directions, the other a list of stocks and quotes for a portfolio. Callers asking for driving directions are likely to be calling from their car, and will want the directions read back to them one at a time, as they complete each step. The application should pause between each step and wait for the user's command to proceed. On the other hand, the list of stocks and their prices should be read back at a more rapid pace, automatically proceeding to the next quote after a short pause. Users would find it tedious and annoying if they were forced to indicate each time they are ready to hear the next quote. As you can see, list navigation can be very context sensitive. A program automatically converting a Web page to speech would have no ability to discern the difference between these applications and would produce applications poorly suited to their intended purpose.

Q: Can I re-implement my DTMF-based IVR application using VoiceXML on Tellme?
A: Yes. The Tellme Voice Application Network provides robust support for DTMF and voice applications as part of the VoiceXML standard. Well-designed voice applications can provide support for both simultaneously.

Q: How does an application maintain state across VoiceXML pages?
A: Either HTTP cookies or VoiceXML/JavaScript variables may be used to maintain application state across the execution of VoiceXML pages.

Just like a Web browser, the Tellme Platform allows a Web site to set and retrieve cookies associated with a user's session. A session begins when Tellme answers a call, and ends when the call is finished. Cookies on the Tellme Platform follow all of the same rules as Web cookies with regards to content, expiration periods and security. By default, all cookies are session cookies, and will disappear at the end of the user's call. Persistent cookies may also be created, but require that callers first be identified using their Tellme sign-in numbers and passwords. Please contact Tellme for further information on this capability.

VoiceXML application variables and JavaScript variables may also be used to maintain state across VoiceXML pages. Values stored in these variables may be accessed anywhere within the context of the same VoiceXML application.

Q: Does the voice-browser accept cookies?
A: Yes. See the above question on maintaining application state.

Q: Is VoiceXML powerful enough to write full-fledged voice applications?
A: Yes, absolutely! Every one of the applications running on 1-800-555-TELL was written on the Tellme Platform using nothing but VoiceXML and JavaScript. They represent the most extensive set of voice applications ever built.

Another way to think about it is to consider that a voice application is nothing but a Web application with a speech-driven interface. Just like Web applications, voice applications can perform hard-core data processing, integration of disparate data sources and legacy systems, and complex user interactions.

Q: What's the purpose of VoiceXML?
A: From the VoiceXML 1.0 Specification:
VoiceXML's main goal is to bring the full power of Web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. It enables integration of voice services with data services using the familiar client-server paradigm.
In other words, VoiceXML provides a way to build a voice application that doesn't rely upon proprietary techniques, but instead leverages the same well-known infrastructure used to build Web sites. This opens up the world of voice applications to a wider audience of developers, and simplifies the integration of phone-based interfaces with existing applications and Web infrastructures.

Q: What type of language is VoiceXML?
A: VoiceXML is a declarative, XML-based language comprised of elements that describe the human-machine interaction provided by a voice response system. This includes:
  • Output of audio files and synthesized speech (text-to-speech).
  • Recognition of spoken and DTMF input.
  • Control of telephony features such as call transfer and disconnect.
  • Direction of the call flow based on user input
VoiceXML may also embed meta-information, references to other VoiceXML files, and JavaScript code, used to implement client-side logic.

Q: What's the history of VoiceXML?
A: The VoiceXML language is a product of the VoiceXML Forum, an industry consortium led by AT&T, IBM, Lucent and Motorola. The Forum was created to develop and promote the standards necessary to jump-start the growth of speech-enabled Internet applications. In August 1999, the forum released version 0.9 of the VoiceXML specification. The final 1.0 specification was released in March 2000 after industry review and comment.

Though the VoiceXML forum was led by industry heavyweights, it needed to make VoiceXML an official standard to guarantee the widespread industry support that would make VoiceXML a success. In May 2000, the Forum submitted the VoiceXML 1.0 specification to the W3C, where it was accepted by the Voice Browser working group as the basis for developing a standard language for interactive voice response applications. VoiceXML 2.0 became a Full W3C Recommendation in March of 2005. VoiceXML 2.1 became a W3C Candidate Recommendation in June of 2005. At the time of this writing VoiceXML 2.1 is only two small steps away from becoming a Full W3C Recommendation.

Q: Are VoiceXML and VXML the same thing?
A: Yes. "VoiceXML" is the official name of the specification. "VXML" is a commonly used abbreviation.

Q: Is VoiceXML an open industry standard?
A: The VoiceXML 1.0 specification has not yet been officially ordained by the W3C as a "standard", but it's the only such language that is both undergoing standardization and that has widespread industry support. In addition to the four founding partners, the VoiceXML forum and the VoiceXML specification is supported by almost 200 companies including 3Com, Ericcson, France Telecom, General Magic, Hewlett Packard, Interactive Telesis, Nortel, Oracle, Siemens, Speechworks International and of course, Tellme.

In fact, multiple companies have already developed, or are committed to developing, VoiceXML-based platforms. These include Microsoft Tellme, Nuance, Speechworks, IBM, Motorola and Lucent.

Q: What version of VoiceXML does Tellme support?
A: The Tellme Voice Browser supports both the VoiceXML 2.0 and VoiceXML 2.1 specifications.

Q: What is Tellme's role in evolving the VoiceXML as a standard?
A: The W3C has accepted VoiceXML 1.0 as a note, and is using it as the basis for defining a Speech Dialog Markup Language. Tellme has been accepted as a member of the W3C's Voice Browser working group, and has several people actively working in several sub-groups, including the Speech Dialog Markup Language, Grammar and Speech Synthesis sub-groups. Tellme participates in the weekly Voice Browser conference call and may soon take on specification-drafting/editing roles as well.

Tellme was recently selected to be one of the six Editors co-authoring the VoiceXML specification in the W3C.

Tellme is also an active, supporting member of the VoiceXML Forum and participated in the last VoiceXML Forum meeting held during Spring Internet World in April 2000.

Tellme is committed to implementing VoiceXML, while helping to drive its evolution based on implementation and real-world experience. By extending the language to meet the needs of the development community and working with the W3C and VoiceXML Forum to have those extensions adopted in a vendor-independent fashion, Tellme will advance the state-of-the-art in voice application development while maintaining cross-vendor compatibility.

Q: What is Tellme's philosophy regarding open standards?
A: Tellme believes that customers are tired of proprietary solutions and is committed to providing best-of-breed technology through open standard protocols and interfaces. This includes not just VoiceXML, but other standards as well, such as HTTP, SSL, JavaScript, cookies, and audio formats such as WAV.

Not only is Tellme building products that comply with appropriate industry standards, but Tellme is also a leader in the voice application space and is driving these standards based on its experience building and running the first, commercial service based on VoiceXML.

Q: Where can I get additional information on VoiceXML?
A: Here are some links for additional information about VoiceXML:

Q: How do I tell the application what spoken input to expect?
A: The voice application uses a grammar to define what utterances are legal for the caller to say at a particular point in the application. (An utterance is speech input before it has been recognized by the voice recognizer as a specific response.) A grammar represents the set of accepted inputs via a list of regular expressions. A grammar representing the possible answers to the question "How do you travel from home to work?" might specify the possible utterances like this in SRGS/GRXML format:
           <one-of>
             <item>
               <item repeat="0-1">by</item>
               <one-of>
                 <item>car</item>
                 <item>auto</item> 
                 <item>bus</item> 
               </one-of>
             </item>
             <item>
               <item repeat="0-1">in</item>
               <item repeat="0-1">the</item>
               <one-of>
                 <item>subway</item>
                 <item>underground</item> 
                 <item>tube</item> 
               </one-of>
             </item>
           </one-of> 
The VoiceXML language lets the developer specify which grammars are in force at any point in the application via a grammar scoping capability. Grammars are included into a VoiceXML file either in-line, or through references to external grammar files.

The Tellme Platform currently supports SRGS/GRXML grammar format, with legacy support for the GSL grammar format.

Q: Does Tellme provide any "pre-built" grammars?
A: Yes, the Tellme Platform provides access to many pre-built grammars. These grammars are either commonly used, difficult to create or require constant maintenance. Tellme currently provides many pre-built grammars, including:
  • General: Yes/No
  • Credit Cards: Expiration date, Expiration month, Expiration Year, Credit Card Number
  • Date/Time: Day of month, Day of year, Month, Year, Date, AM/PM, TimeDuration (in minutes or days), Hour, Time
  • Financial: US Dollars (no cents), US Money (dollars and cents),
  • Locations: City/State
  • Numbers: Digits, Natural numbers, Percentages, Social-Security Numbers
  • Telephone: US Phone number, Phone extension, Area Code, 7-digit phone, 10-digit phone
The Studio Grammar Library contains the full list of grammars and their descriptions.

Q: What is grammar "tuning"?
A: A voice application, like any user-centric application, is prone to certain problems that may only be discovered through formal usability testing, or observation of the application in use. Poor speech recognition accuracy is one type of problem common to voice applications, and a problem most often caused by poor grammar implementation. When users mispronounce words or say things unexpected by the grammar designer, the recognizer cannot match their input against the grammar. Poorly designed grammars containing many difficult-to-distinguish entries will also result in many misrecognized inputs.

Grammar tuning is the process of improving recognition accuracy by modifying a grammar based on an analysis of its performance. Tuning is often performed during an iterative process of usability testing and application improvement and may involve amending the grammar with commonly spoken phrases, removing highly confusable words, and adding additional ways that callers may pronounce a word.

Q: How do you handle speakers with foreign accents?
A: Tellme's work on 1-800-555-TELL , which naturally is designed to be usable by the widest range of speakers, has shown that the Microsoft speech recognition engine employed by the Tellme Platform does a very good job with callers with strong foreign accents.

Q: Does Tellme support recognition of languages other than English?
A: Not currently. For now, Tellme is focused on maximizing reliability of English recognition over the phone. Adding a new language is a non-trivial task. Even if a model exists which lets a recognizer understand a native speaker under ideal acoustic conditions, making that same model work reliably when confronted with the noisy audio environment of the phone is very difficult.

Q: What are the limits on grammar size?
A: Though the speech recognizer doesn't place a specific limit on grammar size, several practical considerations serve to effectively limit the maximum grammar size to tens of thousands of entries. First, as the grammar grows in size, the recognizer must test an utterance against a larger number of possibilities, slowing the recognizer considerably. Second, the larger the grammar, the greater the possibility that it contains words that are easily confusable. This lowers the accuracy of the grammar. However, with careful planning, it is possible to create large grammars that are still effective. Tellme has production grammars with up to thirty thousand entries.

Q: How do I prompt users or deliver information to callers?
A: Audio output is used to prompt callers for input and to deliver information to them. The audio may be specified as a pre-recorded audio file, or as text that is converted to speech on the fly using the Tellme Voice Application Network's text-to-speech (TTS) engine.

Similar to how image are specified in HTML, the audio file is specified by a URL. The Tellme Platform will retrieve the audio file before it is played to the user. The program can also specify text to be played to the user (using TTS) in the event that the audio file cannot be retrieved.

Pre-recorded audio is almost always preferred in production applications, as it provides the most natural sounding interface. Text-to-speech is useful for rapid application prototyping and for generating output from data whose content cannot be guessed in advance. For example, an application that reads a user's e-mail over the phone.

Q: What formats do you support for recorded audio?
A: WAV files using either PCM or m-law encoding with the following parameters:
Encoding Bit-Length Stereo-Mono Frequencies (KHz)
PCM 8- or 16-bit both 8, 11.025, 16, 22.05, 44.1
m-law 8-bit both 8
a-law 8-bit both 8
Based on business need, we may support other formats in the future.

Q: What are the optimal settings for recorded audio?
A: For optimal playback efficiency, audio should be recorded in 8KHz, 8-bit ulaw (G.711), mono format. Although the Tellme platform supports higher fidelity formats, the current phone network only supports audio at these settings.

Q: What happens if I use non-optimal audio file formats?
A: An application using non-optimal audio file format will operate correctly, but will waste network resources. Non-optimal audio files are converted on the fly to 8KHz, 8-bit, mono m-law as they are played to the user. If a great many non-optimal audio files require conversion at once, the audio servers can become inundated with conversion requests, resulting in poor audio playback. It will also take also take longer to retrieve the larger, high-resolution audio files over the Internet, wasting Internet bandwidth for both Tellme and the application owner.

Q: What does Tellme use for producing text-to-speech?
A: Tellme uses the AT&T Natural Voices TTS engine.

Q: How good is the text-to-speech engine?
A: The TTS engine can convert most English sentences to understandable speech. It also handles certain special cases, such as text containing dollar amounts. For example, it will read "$3.75" as "three dollars and seventy-five cents."

Though text-to-speech technology has come a long way since the talking computer in "War Games", the output still has a computer-generated quality. Pre-recorded audio, though time consuming to create, is the only way to generate a natural, human-sounding voice application interface. The widespread use of TTS within an application is not recommended.

Q: What "voices" are available for text-to-speech?
A: The Tellme TTS engine supports only one voice today, an adult female. Tellme may make additional voices available in the future, depending on business need and customer feedback.

Q: In what ways can I alter the TTS output?
A: None of the TTS parameters (talking speed, pitch, voice, etc.) may be modified by voice applications running on today's Tellme Voice Application Network. These parameters may be exposed in the future based on business need and customer/developer feedback.

Q: Does the TTS engine support languages other than English?
A: No. The TTS engine is designed to produce correct output for English words only. If the engine encounters a word it does not know, it will make the best guess on how it should be pronounced. This will produce poor pronunciations for most foreign languages.

[24]7 Inc.| Terms of Service| Privacy Policy| General Disclaimers