Ensure Your Virtual Agents Speak With Clear Intentions

Artificial intelligence (AI) is transforming customer experience. Using Natural Language Processing (NLP), computers understand consumers’ needs and intentions, read their emotional and linguistic cues, and create personalized responses. Intent is an important word in the world of AI. It’s defined as users’ intention based on their inbound inquiries via text or voice. And being able to accurately understand intent is the foundation of creating a correct response for consumers.

But intent is also an important element in outbound responses. In the absence of a human, when a virtual agent responds via synthesized speech, it’s important that the speech intention is modelled appropriately. Without proper nuance, communication breaks down — and that leads to unhappy customers.

Communicate Intention and Modality Beyond Words

Humans use speech for various purposes — to express a feeling, get something done or perhaps report on events. They speak with purpose. When a human communicates with another human face to face, he or she not only listens to the words but also pays attention to extralinguistic cues, such as facial expressions, body language, intonations, tone of voice, etc. With remote communications, in the absence of facial and body cues, intonations (prosody) and tone of a voice become the primary extralinguistic cues to deliver the underlying intention beyond words. Effectiveness relies on communicating both the meaning of the words and the underlying intention.

When creating human speech artificially, you need to decide if it delivers with intention or not. Traditional text-to-speech engines have always opted to adopt the simpler approach — read the text without intention, in a monotoned, neutral way. These engines might sound grammatically correct, but they lack intention. The recipient could understand the semantic — but not pragmatic — meaning of the sentences.

Semantic refers to the obvious, face value of what was said. Pragmatic refers to the study of the intention behind the words. The way a message is delivered has everything to do with these distinctions. Modality is the linguistic term for what the sentence conveys. According to the French linguistic school, modality is said to be “the soul of the sentence.”

When digging deeper into pragmatic terminology, illocutionary force explains how we can expand a discussion more broadly, predicting the detailed expected effects of the utterance. There are six types of force: asserting, promising, excommunicating, exclaiming in pain, inquiring and ordering. These designations create better classifications of possible triggers and desired responses.

The pragmatics of a sentence give us tools to convey the specific modality we want. Beyond prosody, pragmatics also refer to external parameters such as the speech speed (e.g., slow talking might convey additional respect toward elderly or non-native speakers); pitch (low tone could convey, in specific circumstances, a threat); and volume (high volume may convey anger or urgency). Even some small stresses, strategically placed at the right moments, could submit a reversal of meaning. Consider the sentence, “The package will arrive tomorrow.” If added an “eeh” sound just before the key word tomorrow — “The package will arrive, eeh, tomorrow” — it conveys that the speaker might have doubt about that arrival time.

Model Speech Intention for Effective Communication

Modern speech production requires more than just being able to reproduce correct word pronunciation — with the right pauses and the proper intonation. It should be able to translate modality and intent into specific modifications to neutral speech, rendering it more effective and meaningful.

Genesys AppFoundry partner Speechmorphing applies prosodic modeling as an integral part of its speech synthesis to model speech intention. Its advanced text-to-speech system captures a wide range of prosodic aspects of a voice and is trained with the intelligence of prosodic details, e.g., how a certain word should be pronounced with respect to breaks, boundary tones, pitch accents and emphasis. The produced voices can be modified and customized downstream with a great level of flexibility. This technology lets you achieve a higher sophistication with respect to simulated speech, making it sound more natural. And this results a more effective IVR with happier customers.

The AI, neural network and prosodic modeling-based speech synthesis technology is capable of producing the most natural conversational dialogues between human and computer. Its custom, branded, contextual and fully customizable voices support the desired personas and communication styles of digital agents and conversational IVR. Now offering seamless integrations with Genesys platforms, as well as industry-leading ASR/NLU/dialog managers, Speechmorphing helps transform customer experience by taking conversational customer care to the next level. For more information on Speechmorphing products, visit its listings available on PureCloud, PureConnect and PureEngage in the AppFoundry Marketplace.

This post was co-authored by Ron Hasson. Ron is a philologist by education and conviction, and a Chief Linguist at Speechmorphing where he oversees the front-end development of the Speechmorphing award-winning speech synthesis product. Ron is the developer of several TTS products, including a Hebrew TTS engine that is installed in most of the Israeli banks. Ron believes that the future of machine speech lies with neural networks technics which should be carefully validated by traditional linguistics