Adding Voice Personalization to Self-Service IVR

One of the top customer experience digital transformation strategies is self-service IVR personalization. Thanks to advances in artificial intelligence (AI), machine learning, natural language understanding and speech recognition, we can quickly and more correctly understand a caller’s needs and intent—and apply personalization to our responses. With the increased use of self-service and the adoption of conversational IVR, text-to-speech is also on the rise. And it presents its own set of challenges, including high consumer expectations. Consumers expect human-like, clear and personable communication.

A Good Digital Voice for Effective Communication

Effective communication doesn’t just relay the meaning of words; it understands the underlying intention of those words. In face-to-face communication, people use facial expression, body language and voice intonation as cues to detect the underlying intention, facilitating effective communications. In a self-service IVR environment, in the absence of facial and body cues, the prosody and tone of a voice of an IVR or a bot become the primary extralinguistic cues to deliver the intention and help communicate the message clearly and unambiguously. Prosody represents the expressive aspects of human speech, including intonation, stress, cadence and rhythm.

Digital voices usually are evaluated using the following criteria: naturalness, intelligibility or clarity of the speech, sound quality, and ease of understanding—which refers to how much effort goes into understanding the sentence from the words. Clarity of speech and ease of understanding are obviously very important. But what makes voice more natural-sounding and intelligible is the prosody.

Humans cannot really separate the words from how the sentence sounds (prosody), for instance. If you make a statement that sounds like a question, at best, the listener will understand that you are uncertain or that you doubt yourself. Another example is asking, “Did you read him his rights?” If you emphasize “you,” the sentence take one meaning. But if you emphasize “him,” it takes another meaning. If neither word is emphasized, it takes a third meaning.

In addition, you can’t deliver a sad set of words with a happy tone, or vice versa, because it’ll sound awkward and insincere. Obviously, prosody is extremely important and can even change the meaning of a sentence.

Most current text-to-speech systems don’t explicitly model prosody. As such, they often fail to communicate concisely and don’t fully engage consumers. In many cases, they produce a monotonous-sounding speech that can seem very stilted.

Genesys AppFoundry partner Speechmorphing applies prosodic modeling to its technology as an integral part of speech synthesis. Combining prosodic modeling with AI, machine learning, deep neural network and advanced speech technology, the company has developed a text-to-speech system that produces synthesized voices that emulate human-to-human interactions.

Make Your Digital Voice Branded and Customizable

Personalized voice comes in two flavors: a custom branded voice that’s unique to your organization and conveys a desired persona; and a voice that’s customizable for personalized IVR prompts and call flows based on caller information with situation-appropriate voice styles and tones. A simple example could be the ability to speak slower in response to people exhibiting certain cues.

But not all digital voices are customizable. In fact, most are not. Text-to-speech systems without prosodic modeling are trained without the intelligence of prosodic details, such as how a certain word should be pronounced with respect to breaks, boundary tones, pitch accents and emphasis. These systems can’t customize produced voices downstream. In contrast, text-to-speech systems that model prosody capture a wide range of prosodic aspects of a voice; they also can control and modify the output with a great level of flexibility later. And that results in highly customizable, natural-sounding voices.

TemplateEditor™ is one of several tools that Speechmorphing has designed to allow users to control and customize the voice output in detail. The tool starts from one sentence (or one template) with default prosody. It then allows the user to customize it using a very intuitive graphics user interface, including the ability to define variable fields that will be changed at runtime. This proves to be extremely useful for self-service IVR prompt creation and negates the need for stitching.

With respect to branded voices, very few companies could afford this luxury. Quick and inexpensive generation of a new voice has been a challenge for the text-to-speech industry. Developing a new voice can take up to six months and demands a large budget, starting with weeks of recording, followed by months of processing of recorded material and voice training.

The Speechmorphing deep neural network-based voice generation process can produce a new, high-quality voice based on only 30 minutes of recorded speech, whether from existing IVR recordings or newly recorded speech. Because it requires less recorded material, the subsequent production work is also reduced. And a new voice can be produced within days—and at a fraction of the cost of conventional custom voice services.

Speechmorphing has a unique ability to create high-quality and expressive synthesized voices with a small amount of speech data. This makes custom voices a true option for companies and applications. And it addresses the growing need of companies that want personalized, branded voices for their self-service IVRs and digital agents. The added expressiveness of the voice can align with the desired persona and communication styles of the digital agent.

Visit Speechmorphing at the Genesys AppFoundry Marketplace to see how its solution complements existing Genesys personalized self-service IVR solutions by introducing a new way of speech synthesis, aiming to improve customer engagement and preserve corporate brands.

This post was co-authored by Shing Pan, Vice President of Marketing and Business Development at Speechmorphing. Shing leads marketing and business development at Speechmorphing, a personalized speech technology company aiming to improve human-machine communications. As a serial entrepreneur and an experienced marketer, Shing’s expertise includes developing, positioning and growing new products and businesses. Shing is fascinated by the converging trends of conversational AI, human-computer interaction, and customer experience.