
Welcome to the future. Say goodbye
to those high-pitched robotic voices which are now ancient history.
Today’s new breed of speech recognition and synthesis technologies are
now being widely adopted on the market. Integrated with your
information system, these new technologies become remarkably powerful
customer service tools.
Function
Speech synthesis
Speech synthesis, today referred to as Text-to-Speech (TTS), converts
normal language text into speech. The texts can come from a database to
restore a variety of sounds in accordance with the person hearing them.
Speech synthesis is now widely used in dynamic voice portals to
communicate a wide range of information including : account balances,
flight times, delivery times, etc.
Speech recognition
Speech recognition, now referred to as Automatic Speech Recognition
(ASR), is a technology that allows a computer to identify the words
that a person speaks into a microphone or telephone. Coupled with
speech synthesis, voice command, voice identification and
comprehension, ASR forms the ideal man-machine interface, comfortably
handling 10 times more information than you would with a keyboard entry system...
Though speech synthesis (TTS - Text To Speech) doesn’t necessarily
require speech recognition (ASR), the opposite is not true. In a
Man/Machine conversation with speech recognition, complementary
synthesis resources are indispensable.
Vocal 2.0 is today’s byword for the outstanding performance provided by this technology.
Key advantages
- Navigation by voice, no need for a keyboard
- It’s fast : connections are much faster
- Innovative customer service interactions
- Information is easily accessed by natural language
- Calls are routed to the best qualified agents
- Highly efficient : much more information is gathered
- Modern & Flexible : intelligent man-machine language for a more natural navigation experience
How is a speech recognition project materialized ?
There are 4 main phases in a speech recognition project
1. The existing voice system (IVR, call center) is audited, with analysis for potential integration with the we
Voice ergonomics, preparation of scenario, detailed specification of service, scenario for transition to DTMF.
This phase represents 15% of the project development time.
2. Advice on voice ergonomics
The various dialogs are described, integrating the natural language in the man / IVR dialog.
In the past, a big mistake consisted in substituting an action on a
button by a keyword, like « say BALANCE or press button 23 (assuming
you find it !) to access your account balance”.
This phase represents 35% of the project development time.
3. Development and Prototype
Here, we create the VXML pages and grammars, scenarios and confirmation audio messages.
This phase represents 15% of the project development time.
4. Analysis and Tuning
Human conversations are listened to using the IVR to improve the
phonetic conversation by the speech recognition engine. During this
phase, the speech recognition rate is improved to achieve a success
rate of up to 80%.
This phase represents 20% of the project development time.
The core of a natural language voice application is the grammar.
Grammar is used by the speech synthesis engine to translate the
different ways of pronouncing a word.
Typical Dialog
In a modern Contact Center, a dialog may sound something like this :
The customer calls a travel agency.
-
[SVI] : « Hello, welcome to SibiloTour » --> pre-pickup with fixed audio prompt
-
[SVI] : « How can I help you? » [TTS] « Eric, are you calling to know if the plane ticket you ordered yesterday was sent ? »
-
[Client] : « Yes » -->[ASR processing]
-
[TTS] « It was sent in the post yesterday to the following address: 12 rue Victor Hugo, in Paris, 15th arrondissement »
-
[TTS] « You
will be spending two nights in Paris. You haven’t reserved a hotel.
Would you like Mathilde or someone else available make you a
reservation? »
[Client] : « No, I would like to modify my ticket please » -->[ASR processing]
-
[TTS] « You wish to change your ticket, is that right? »
-
[Client] : « Yes » -->[ASR processing]
-
[TTS] « Please hold, I’ll
connect you with Mathilde in less than 1 minute. While you’re waiting,
would you like to listen to some music, or the news?»
-
[Client] : « Jazz, please » -->[ASR processing]
Used Technology : MRCP
Sibilo Voice, App-line’s IVR, seamlessly interfaces with today’s leading voice synthesis and recognition engines.
How it works
Text To Speech (TTS) : The voice server sends a text (list of words or
phrase, to the speech synthesis server which in turns sends an audio
stream to the customer’s telephone. With today’s increasingly
sophisticated speech synthesis engines, it is possible to change voice
gender (male or female), intonation, or talking speed. The voice can
also be mixed with music.

Speech recognition (ASR) : The principle is reversed. The voice server
generates a sound to the speech recognition engine with a grammar of
words to be recognized. In return, the ASR engine informs the IVR that
the person has pronounced, or not, the words of the grammar. It also
quantifies the accuracy with which the word “Boat” was recognized.
Not too many years ago, TTS and ASR servers used with IVRs needed to be
combined into a single product. This made speech scenario development
much more complex and time consuming. Today, the IVRs and the TTS and
ASR servers run on separate machines and dialog by the MRCP protocol.
The end result : the development time is now divided by more that 10.
Using the MRCP protocol, Sibilo Voice operates with most of the speech
synthesis and recognition engines available today. The 4 industry
leaders are active in Europe and their engines run with App-line’s IVR.
Certain brands have both a speech synthesis and recognition engine,
while others only provide one of the 2 products.
| Compliant with Sibilo Voice
|
TTS
|
ASR
|
 |
√ |
|
 |
√ |
√ |
 |
√ |
√ |
 |
√ |
|
 |
|
√ |
|