Perfecting the Voice Experience for Conversational Customer Care

Good customer experience requires consistent communication and, at the same time, personalized user interaction that’s relevant to the context and situation. For personalization in today’s customer experience, artificial intelligence (AI) is essential — and text-to-speech plays a crucial part. It should contribute uniquely branded custom voices, personalized call flow with appropriate tones and styles, and natural sentence prosody and intonation for perfect delivery.

But text-to-speech has a tough job. Based on the input text, it does its best to predict both pronunciation and prosody in context. But it often could be handed unfamiliar words — proprietary company words or rare customer names, for example — and use cases vary widely. So, the final results might be perfect in every respect for the intended use, but sometimes they’re not. Those important words might be mispronounced or the intonation might not convey the desired meaning, style or tone.

Customization Via Markup Languages and Their Limitations

The need to sometimes head-off such imperfections in text-to-speech by giving special instructions to the relevant programs explains the popularity of Speech Synthesis Markup Language (SSML) and similar technologies. Tags supplied by these markups (like <emphasis>word</emphasis>) can inform a TTS engine about various aspects of the desired rendering, including pitch, contour, pitch range, speed, duration and volume. However, SSML and similar markup languages are still limited in what they can achieve.

The first weakness relates to control over prosody and intonation, the expressive aspects of human speech. In most TTS that’s based on SSML and similar languages, the prosody control remains inexact. There, markup tags can’t indicate the specific prosodic elements that linguists use to describe the melody of a phrase or sentence — the precise pitch movements affecting words and larger segments, and the degrees of pause between them.

And secondly, they’re inflexible, where they apply only clumsily to variations of a text message. The problem here is that TTS customers frequently have to render sentences in which some elements are fixed while some are variable, with the variable values supplied at runtime. For example, in “Welcome back, Mark!” the welcoming part is the same for everyone, but the customer’s name has to be inserted while retaining the exclamatory prosody! When such partly variable responses are handled by stitching segments together programmatically, the results often sound jerky and unnatural. And as the variables multiply in longer sentences, the prosody quality decreases while the production hassle and expense increase.

Customization Beyond Markup Languages

Genesys partner Speechmorphing tackles this with its TemplateEditor tool, in which texts to be synthesized can have any number of variable fields: “Welcome back, <name>! Wow, I see that you now have <number> credits. Congrats!” The fields are filled at synthesis time and retain the tone (here, excited) and prosody of a sample synthesis containing default values. The template’s prosody can be customized and fine-tuned for maximum effect due to explicit prosodic modeling. By applying prosodic modeling as an integral part of its speech synthesis, its text-to-speech system captures a wide range of prosodic aspects of a voice. Voices trained with the intelligence of prosodic details can be modified and customized downstream with a great level of flexibility. And in fact, many templates are created to enable this fine-tuning, even if no variables are required.

An editing session allows detailed modification of each word in the template. Several aspects of each word can be manipulated: Phonemes, Phrase Break, Boundary Tone, Focus, Tone and more. You can declare any word position as a named variable Field. If multiword fillers are supplied at runtime, the system will automatically adjust the prosody.

Any number of partly fixed and partly variable use cases suggest themselves for this sort of template facility: fast-food ordering, follow-up phone calls for sales or other businesses, customer service in call centers, and many more.

Customization for Branding and Contextual Customer Experience

Bringing voices under increased customer control and adding flexibility is vital. But they aren’t the whole story. Speechmorphing offers two additional ways to personalize and customize synthetic voices:

  • Custom-made voices: Voices that align with an established company brand and tone of voice. These can be created with only minutes of audio materials and only days of turnaround — even for high-resemblance sound-alike voices, or for custom voices carried over to other languages, with or without accents.
  • These bespoke voices are expressive. Variants can be quickly produced with a wide range of tones and styles to support lively and context-appropriate user interactions.

Speechmorphing is available as a Premium Application of the Genesys CloudTM platform. It’s also seamlessly integrated with the Genesys Voice Platform and industry-leading conversational AI platforms. Speechmorphing helps transform the customer experience by taking conversational customer care to the next level.

For more information on Speechmorphing, sign up for their upcoming webinar on October 7th and visit its listings, available on the Genesys CloudTMPureConnect and Genesys Multicloud CXTM products in the AppFoundry Marketplace.

This blog was co-authored by Dr. Mark Seligman, Chief Linguist at Speech Morphing, Inc., a Natural Language Speech Synthesis company aiming to improve human-machine communication. Mark is also Founder of Spoken Translation, Inc. In 1998, he organized the first speech translation system demonstrating broad coverage with acceptable quality. Mark serves on the Advisory Board of TAUS (the Translation Automation Users Society) and on the Steering Committee of IWSLT (the International Workshop for Spoken Language Translation). Mark regularly publishes on speech and cognitive science topics.