It’s a great time to be a voice user interface (VUI) designer. Voice user interfaces are becoming more and more common in our daily lives. To ensure great user experiences, it’s crucial that designers lead the way in this space.
Many visual designers and interaction designers who are interested in becoming VUI designers are well placed to switch from designing more traditional graphic user interfaces (GUIs) to designing VUIs. Although all UX design disciplines share certain principles, there are some things about VUI design that differ from GUI design for Web or mobile apps. In this article, I’ll cover the main things you should keep in mind when designing VUIs.
Understanding User Intent
Let’s start with one of the biggest differences: understanding the user’s intent.
In a mobile app, you know when the user has tapped a button or selected a menu option or swiped right. In a VUI, what the user intends can be a bit more difficult to discern. To fully understand why this is the case, we need to break down the process of handling voice input into two parts:
automated speech recognition (ASR)—Also known as voice recognition and speech-to-text, ASR refers to the process of capturing the audio input from a user’s speech and turning it into words. Generally, the system does not attach meaning to utterances at this point, but simply transcribes the words the user has said. Imagine listening to someone speak a language you don’t understand. If you tried to write down what that person was saying, you could capture the sounds of their words on paper, but you wouldn’t have any idea what they meant.
natural language understanding (NLU)—This is how the computer processes the recognized words and extracts the user’s meaning from them. In a GUI, there are typically only a few ways the user can indicate an intended action: a swipe, a tap, a click, a scroll, and so on. While there may be more than one way to take a specific action—perhaps a button on the main screen, as well as a menu option—the set of interactions is constrained. In contrast, with a VUI, the user can mean a particular thing, but say it in many different ways.
Take something as simple as setting an alarm for 8am. Here is just a sample of the different ways in which a user could say this:
Set an alarm for 8am.
Umm… Can you please set my alarm for tomorrow morning, at 8?
Hey… Set a timer for 8am. (While this isn’t technically correct—setting a timer is not the same thing as setting an alarm—it doesn’t matter. It’s what the user intended.)
Alarm 8 o’clock
I need a wakeup, tomorrow morning, 8am.
Because ASR and NLU aren’t perfect, it’s important to spend a lot of time thinking about the various ways in which users might indicate their intent. It’s also vital to think through what the system would do when things go wrong—because they will.
A real-life example is asking the user “How are you today?” A common response to this question is the one-word answer “Fine.” But, it turns out, speech recognition has a harder time with short words, so often incorrectly recognizes this response as “Find.” Your VUI must have a way of handling such errors.
In addition, ASR is not always good at understanding context, as Figure 1 demonstrates in a tweet from Paul the Trombonist—who accidentally left voice dictation on while practicing.
Image source: @JazzTrombonist
Nevertheless, technology has made terrific strides in both ASR and NLU. Back in the late 1990s, speech recognition in interactive voice response (IVR) systems, or phone systems, was just so-so, and designers had to create extremely specialized grammars to enable those systems to understand what the user said—including even things like “uh” and “um” and “thanks.” Today, recognition accuracy has reached 90% or greater—even without a constrained topic. While it’s still not perfect—and even a small gap in accuracy can result in user frustration—it’s good enough to allow us to launch great voice user interfaces.
VUI Design Principles
Just as certain design principles enable us to craft the best designs in the visual and interaction design worlds, VUIs have their own set of design principles.
Avoiding Overwhelming the User
When someone is using a GUI, they are free to browse the information on the screen at their own pace. But, with a VUI, users can’t skip ahead or easily return to previous information. Plus, cognitive load is higher for audio-only user interfaces. For example, a user can easily scroll through a list of ten items to select a song, but imagine hearing the names of ten songs read aloud, one at a time, then trying to make a choice.
For any design solution, it is, of course, important not to overwhelm the user, but this is especially important for voice-only systems. Remember, too, that audio is a slower medium. Most people can read much faster than they can listen. Be respectful of your user’s time.
Some designers interpret this to mean VUI interactions should be as short as possible, but this is not always the case. Users can handle a larger number of turns, or back-and-forths, in a conversation, as long as they feel that the system’s questions or instructions are relevant and they’re getting somewhere. It’s better to have a longer, clear conversation than to try to shove all the necessary information into just a couple of steps.
Using Progressive Disclosure
In the GUI design world, progressive disclosure refers to the concept of presenting only the options necessary to complete a task to the user, without overwhelming the user with clutter.
In the VUI world, we often refer to this concept as just in time. Remember, with voice-only systems, users cannot take in as much information at one time. With a GUI, if you’ve got a lot of options, you can provide a scrollable list and users can take their time looking the options over, then select one. With voice, don’t count on users being able to remember more than about four items—plus or minus one.
Ideally, prompts—what a VUI says—should give users an idea of what they can and can’t say, without listing the options. Plus, you can use an instruction rather than a question if that would help. Here are some examples:
“What’s your favorite color?”—No need to list them out. Users know what colors are. Just make sure your set of expected responses can handle a wide variety of colors. One person’s teal is another person’s turquoise.
“Please tell me your main symptom.”—Imagine reading out hundreds of possible symptoms! Again, you’ll need to do testing and monitoring to ensure a system can respond correctly to the very large variety of ways in which people might describe their symptoms. Allow users to give you multiple symptoms at a time, as well.
If you must provide a list of options, think about what type of items you’re presenting. If it’s a list of familiar words such as types of housing, it’s not such a difficult cognitive task. If I say, “Do you live in a house, an apartment, or an RV?” that’s easy. But if I say, “Here’s a list of restaurants: Jack’s Primo Deli, Blue Spoon Cafe, and Sushi Kuu,” and the user has never heard of them, it’s much harder. Imagine how difficult it would be if the items being presented weren’t even related to one another.
It can be tempting to tell users up front about all the possible options, but this can be overwhelming and confusing. To illustrate this point, here’s an example of a fictitious hotel booking VUI:
Welcome to Pearl’s Hotel Booker! I can help you book a hotel. Be sure and have your credit card handy. If you need help, say ‘Help.’ Also, I can’t currently book any hotels in Canada. Okay, what date are you traveling?
Whew! That’s way too much! And I’m not even thinking about booking a hotel in Canada.
Here’s an improved version:
Welcome to Pearl’s Hotel Booker. How many nights will you be staying?
Make the name descriptive, and start with the key question.
It’ll be for three nights.
Three nights. Got it. Now, for what city?
I’m sorry, I didn’t understand. For what city?
Not bad, except the last part. Suppose we see in our data that a lot of users are asking for cities on the US border, such as Vancouver. Rather than having the VUI act confused, make it more informative.
Sorry, I’m afraid I can only book hotels in the US. Which city would you like?
A key principle in VUI design—and in life—is that people like what they say to be acknowledged, even if you can’t solve their problem.
Creating Multimodal User Interfaces
Up until now, I’ve been discussing principles for voice-only devices such as smart speakers like the Amazon Echo or Google Home. More devices are coming onto the market that use voice in combination with a visual display. For example, the Echo Show and the soon-to-be-released Google Smart Display are voice-forward devices rather than voice-only devices, meaning voice is their primary user interface, but not the only one.
While having a screen is not always necessary, it can come in handy in a lot of situations that would benefit from more visuals—such as browsing, shopping, watching a cooking video, playing news footage, or any time you need to present a larger set of information.
A key thing to note is that error handling and turn-taking are different for multimodal VUIs. In the current world of VUIs, the user speaks, then the system, then the user, and so on. If the system asks a question and doesn’t hear a response or doesn’t understand the response, it often reprompts with a gentle reminder—for example, “What time did you say you wanted to leave?” In a multimodal VUI, there is usually an accompanying set of visual options—such as buttons or a carousel—with which the user can interact. In such a case, if the user says something the system does not understand, it should let her choose one of the visual options, then move on rather than reprompting.
It’s not necessary to have a corresponding visual element for every question. Consider the earlier example: “Please tell me your main symptom.” This is a great question for voice-only systems. There’s no need for a huge list that would distract users. On the other hand, if the user says something the system doesn’t recognize, backing off to a menu—which would let the user use touch or voice—is a good solution. For long lists like this, remember to include a “None of the above” option. Your users may ask you for things you didn’t anticipate so, if you have another way to help them—such as a human helpline—provide a safety valve.
Creating VUI Design Deliverables
VUI designs have deliverables, just like GUI designs. To design a great VUI, start with requirements and high-level design details, just as you would with a visual design.
The next step is the VUI equivalent of visual-design mockups. However, rather than creating wireframes of screens or Web pages, VUI designers create sample dialogs, which are the pathways that could occur between the system and the user.
The earlier example, Pearl’s Hotel Booker, is a sample dialog that shows a series of turns between the VUI and the user, with the system’s prompts and a user’s possible responses. It’s best to write multiple sample dialogs, including both blue-sky paths, when things go well, and error paths, when things go off the rails.
For most VUIs, you’ll also need a flow diagram. A flow should focus on one particular use case such as booking a room at a hotel. Be sure to include some cases where things go wrong! The flow diagram in Figure 2 is for a multimodal hotel-booking experience.
Finally, create a detailed design specification for the VUI. Sample dialogs show only a subset of all possible paths through the VUI, but the edge cases and especially the error cases are essential.
In usability testing for GUIs, we often ask users to narrate their task out loud as they go. For VUIs, that’s not usually possible because the system is also listening to what the user says. So it’s best to observe users’ reactions as they complete tasks, then ask follow-up questions after each task.
Wizard of Oz testing is a great way to test VUIs before they’ve been built. One quick-and-dirty approach is to ask users to text your system, but have a real human respond on the other end. While this approach is not perfect—people do not speak and text in exactly the same way—it’s a good way to get started.
Gathering user-research data early on and iterating your designs is crucial. Remember, users could respond to prompts in a variety of ways, including ways you didn’t expect. So it’s essential to gather data to build your models, then continually check your logs to see how the system is performing and where you could improve it. It is critical that your user data be anonymous, that you allow users to delete their own data, and that you delete users’ data regularly from the system.
Voice user interfaces are one of the newest ways in which people are interacting with computers, and this space is growing quickly. Gartner predicts that 75% of households will have smart speakers by 2020. Many developers have jumped on the bandwagon and are creating VUIs. However, for VUIs to be truly successful, they must engage users and help them complete their tasks.
Many of the best practices for designing VUIs are the same as those for creating visual designs or interactive experiences: respect your users, solve their problems in efficient ways, and make their choices clear. But there are some unique design principles for VUIs as well. Remember, we don’t always know for sure what a user’s intent was. Plus, it’s necessary to spend more time on error cases. If you keep the principles I’ve described in this article in mind, you’ll be well on your way to crafting great VUIs.
Previously, Cathy was VP of User Experience at Sensely, whose virtual-nurse avatar, Molly, helps people engage with their healthcare. An expert in voice user interface design, Cathy is the author of the O’Reilly book Designing Voice User Interfaces. She has worked on everything from NASA helicopter-pilot simulators to a conversational iPad app in which Esquire magazine’s style columnist tells users what they should wear on a first date. During her time at Nuance and Microsoft, Cathy designed VUIs (Voice User Interfaces) for banks, airlines, and Ford SYNC. She holds a BS in Cognitive Science from UCSD and an MS in Computer Science from Indiana University.