Speaking Up for Conversation Design

October 21, 2019

Voice-first user experiences are now ubiquitous. Smart-speaker sales are up, as is their usage. Older adults are benefiting from the use of voice assistants. However, a large percentage of users still find the experience of talking to voice assistants unnatural. So how can we make voice experiences better?

While understanding conversation design–best practices is necessary, it’s not enough to make these conversations feel natural. To take your voice user interface (VUI) to the next level, you must add polish—in the form of pacing, sound effects, and diverse phrasings. Let’s take a look at each of these in turn.

Champion Advertisement

Using SSML to Pace Conversations

On the Web, we’ve moved way past the days of plain-text HTML pages with one font and those infamous, underlined blue links. Today, you wouldn’t create a Web page without CSS, and you shouldn’t be creating voice experiences without SSML, or Speech Synthesis Markup Language. SSML lets you add pauses and pacing to conversations.

When we speak with other human beings, a dramatic pause can convey emotion. But Alexa just keeps speaking to me in the same tone and at the same pace until I practically fall asleep. To make your voice experiences sound more realistic, use the <break/> tag to insert a pause in voice responses. You can change the duration of a break by specifying attributes. For example, as shown in Figure 1, <break strength="x-strong"/> creates an extra-long break between words.

Figure 1—Specifying a break in the conversational flow

To convey excitement, it’s helpful to use the <prosody> tag to change the rate and pitch of the default speaking voice, as shown in Figure 2.

Using the <prosody> tag — Figure 2—Using the **<prosody>** tag

But this is not all that SSML tags can do. You can also use these tags to emphasize words, spell out acronyms, or even pronounce words in a different language. SSML adds nuance to voice responses, making them sound more natural and helping to define a unique persona for your application.

Implementing Sound Design

Have you ever listened to an old radio show? Before TV, radio relied on using sound effects and music to tell a compelling story. The 1938 broadcast of “The War of the Worlds” is infamous for the public hysteria it created among listeners. We can use what we’ve learned from such radio shows to take our voice experiences to the next level.

Both Google and Amazon have extended SSML to enable voice designers to play sounds from their native library. For example, Google provides tags for mixing dialogue, sound effects, and music. To produce stellar results, you can change sounds’ volume, fade sounds in or out, and control the duration of sounds.

As an example, let’ s take a look at a horror game I’m working on as a side project. You’ll notice that I use ominous music and sound effects to build tension.

To group different sound bites, I simply apply the <par> tag. Then I use the begin and end attributes to offset sound bites from one another. In Figure 3, you can see an abbreviated code sample that illustrates how to do this.

Using the <par> tag — Figure 3—Using the **<par>** tag

While not all voice experiences require the immersion of a game, all can benefit from earcons, which are short sound effects that convey information. Earcons are especially useful when a voice response isn’t necessary. For example, smart-home applications use earcons to provide feedback when the user turns a light off. Because the earcon sounds at around the same time that the light goes off, the combined feedback of the earcon and the room going dark is a much more elegant way of providing feedback than the spoken response “Okay, I’m turning off the light.” As a best practice, look for clever ways to use earcons in your voice applications.

Diversifying Phrasings

Out of the box, both Alexa and Google Assistant provide simple ways to vary voice responses. For each case where a voice response is necessary, you can provide a few different phrases. The system automatically selects which one to use. Even though writing numerous voice responses is better than providing just one, your app still ends up feeling stale to repeat users. With custom logic, you can create a much wider variety of answers for a voice assistant to use.

Depending on your business needs, there are many different ways to build your voice-interface logic. I like to create many sets of responses, then combine them together. For example, one type of response could be transactional—phrases that the user needs to hear. The second type of response could give flavor by adding extraneous commentary that makes the response seem more conversational and natural. Additional logic could leave these phrases out once in a while. For example, in my game, I randomize the zombie sound effects to add even more variety. Figure 4 shows some of the possibilities these capabilities can deliver.

Depending on your application’s functionality, you’ll need to craft responses in different ways. For example, let’s say you’re creating a weather app. You could create separate flavor response sets that correspond to different ranges of temperature or weather conditions. By adding a little extra logic, you can exponentially increase the number of possible responses. Then, every time users interact with your app, it will feel fresh.

Conclusion

The voice user-interface industry is on the cusp of a boom. We should be focusing on sound and conversation design to make voice experiences more natural and engaging. Brands that offer polished voice experiences will stand out in an endless sea of voice apps. Will yours be one of them?

In Front-End Development | Voice User Interface Design

Speaking Up for Conversation Design

Using SSML to Pace Conversations

Implementing Sound Design

Diversifying Phrasings

Conclusion

No Comments

Join the Discussion

Clint Miller

Other Articles on Voice User Interface Design

New on UXmatters