Text-to-Speech: Rising Expectations in the Neural Era

HAL 9000 has inspired many since 1968. The idea of a communicating computer, functioning as a member of a spacecraft’s crew, was fascinating. And now HAL’s grandchildren are among us: Siri, Google, Cortana and many more. The grandkids aren’t quite as smart as HAL yet, but they’re real and doing real work. And now it’s up to us to make them smarter and more capable — and to make them earn their pay.

Present-day virtual assistants are composed of multiple components, and no one organization can put them to work single-handed. It’s critical to deliver more human-like, expressive and branded text-to-speech (TTS), aka speech synthesis, the part that turns text into talk. Despite the importance of TTS and its considerable improvement since HAL, more can be done to make synthetic speech more natural and human-like with the goal of improving voice experiences for self-service.

The Impact of Neural Net Training

A virtual assistant has three major components: the automatic speech recognition (ASR), which turns the user’s spoken sound into text; the natural language understanding (NLU) element, which analyzes that text — explicitly or implicitly — and figures out what the system’s response should be; and the text-to-speech element, which turns the response text into voice.

Over the last two decades, all three components have taken part in a technological tectonic shift. Formerly, human programmers wrote them. Currently, though, while many handmade programs persist in supporting roles, the stars of the show – for all three major components of virtual assistant systems – are programs that learn from examples. Most now tend to be neural, in which the progress from input to desired output is handled by a network of virtual if-then wires whose transmission strengths are set automatically and reset until they yield the right results.

Neural processing is hard to beat. It can learn without explicit instruction; it can learn multi-level abstractions while taking very broad context into account; and, as a bonus, it can deliver lightning-fast outputs. Its use has brought dramatic improvements to the speech recognition and natural language understanding components of virtual assistants. But while neural methods have recently been used in their text-to-speech components as well, until now, the improvements in TTS could be judged as less dramatic than those in ASR and NLU.

Text-to-speech results, after all, are measured mainly in terms of sound quality within a specific use case. That quality reached a relatively acceptable level years ago for use in one-way spoken systems speaking in a neutral style. But today’s conversational agents raise higher-quality expectations. It’s no longer enough that a machine can speak. We now expect speech with near-human naturalness and expressiveness.

From Spoken System to Conversational Agent

As virtual assistants are coming into their own, the world of human-computer interaction is moving beyond highly structured and narrowly focused dialogs toward free-flowing and wide-ranging conversations. With that change, the TTS lag is becoming much more apparent and problematic. The importance of human-like synthetic speech is magnified. The neutral speaking style – the even tone that HAL maintained even while undergoing extreme termination – will no longer suffice.

Going forward, to earn their keep, systems must be not only competent and informative but engaging and well-adapted to their particular tasks. And custom, made-to-order voices must also be available quickly – delivered when they’re required, not months later as most still are.

Genesys AppFoundry partner Speechmorphing emphasizes three crucial areas of TTS development.

  1. Custom text-to-speech that align with an established company brand and tone of voice, created with only minutes of audio materials and only days of turnaround.
  2. Expressive voices with varied tones and styles to support lively and context-appropriate user interactions.
  3. Dialogue customization and voice tuning capabilities so that, if the out-of-the-box performance of a vocal segment is not yet precisely what’s needed for the conversation at hand, it can be modified — intuitively but in detail, as a director guides an actor – until it’s flawless.

Speechmorphing is available as a PremiumApp of the Genesys CloudTM platform. It’s also seamlessly integrated with the Genesys Voice Platform and industry-leading conversational artificial intelligence platforms. Speechmorphing helps transform the customer experience by taking conversational customer care to the next level. For more information on Speechmorphing, visit its listings, available on the Genesys CloudPureConnect  and Genesys Multicloud CXTM products in the AppFoundry Marketplace.

This blog was co-authored by Dr. Mark Seligman, Chief Linguist at Speech Morphing, Inc., a Natural Language Speech Synthesis company aiming to improve human-machine communication. Mark is also Founder of Spoken Translation, Inc. In 1998, he organized the first speech translation system demonstrating broad coverage with acceptable quality. Mark serves on the Advisory Board of TAUS (the Translation Automation Users Society) and on the Steering Committee of IWSLT (the International Workshop for Spoken Language Translation). Mark regularly publishes on speech and cognitive science topics.