What will be the voice-technology winner of tomorrow—voice-first or multimodal user interfaces? Those working in the voice user-experience sector are avidly discussing this hot topic—and UX researchers, UX designers, developers, marketers, and entrepreneurs may find it of interest as well.
In this article, I’ll define the terms voice first and multimodal, using current products as examples, explore some use cases and rationales for different types of user interfaces, consider contemporary research, and conceptualize the future of voice user interfaces. Should you keep your product’s visual features? Yes, because, ultimately, voice-enabled, multimodal user interfaces will be the preferred user experience.
What Is a Voice-First User Interface?
In Google patents  dating back to 2008, voice first has been the term for user interfaces whose primary or sole functionality is accessible through human speech. Voice-first products that are well-known to mainstream consumers include the Amazon Echo Dot and Google Home smart speakers, shown in Figures 1 and 2. Smart speakers use speech recognition and synthesized speech to assist users in completing tasks such as playing music, getting answers to questions, and turning off the lights. While many people associate the idea of voice assistants with Siri, Apple’s voice assistant, Siri’s momentum has dwindled outside the smartphone context.
Since these market-leading voice-first products are readily gaining adoption in the home, competitors such as Microsoft, Samsung, and Alibaba, the Chinese version of Amazon, are lining up with smart speakers and other voice-enabled devices of their own. The price point for most smart speakers ranges from $40–130, with sales occurring frequently. While the tech giant Facebook has not yet launched any sort of speech-recognition tool, recently Tech Crunch  and other sources have cited leaks about a forthcoming Facebook voice-recognition assistant called Aloha.
Voice-first products are becoming more and more ubiquitous in the USA. Following the 2017 winter-holiday season, NPR and the Edison Group  estimate that one in six adults in the USA has a smart speaker. Current estimates  are that 50 million smart speakers are in use in the USA, a user population that is comparable to Colombia’s  entire population.
Simple, unassuming voice-first user interfaces blend into their surroundings. Most of these products are neutral in color and appearance for a reason. Some companies go so far as to talk about making the technology’s physical presence vanish. The Parisian startup Snips’  slogan is “Using Voice to Make Technology Disappear.” While a diminished or complete lack of physical presence might seem appealing in smart-home contexts, this approach could backfire—leading to the increased discomfort of users who are concerned about privacy and feelings that Big Brother is always listening in, without anyone even knowing the technology is there.
As these products become more ubiquitous, this question arises: will the momentum behind purchasing, adoption, and retention of voice-enabled products get behind voice-first or multimodal products?
What Is a Multimodal User Interface?
Multimodal user interfaces incorporate images, icons, videos, sounds, and interactive content. Voice-enabled multimodal interfaces include Lenovo’s Smart Display with Google Assistant, which you can see in Figure 3, and Amazon Echo Show, in Figure 4. For clarity, I’ll use the term multimodal throughout the remainder of this article when referring to user interfaces that incorporate both voice and visual elements.
The price points of these products are much higher, ranging from $130–250, but the capabilities of these devices are also significantly greater. The Lenovo Smart Display lets users watch YouTube cooking videos and view visualizations of their Google Calendar. Google also has plans to sync information from Smart Displays across platforms, including smartphones. For example, if a user asked for directions to a restaurant, he could immediately see the directions on both his Smart Display and his smartphone.
People can use the Amazon Echo Show in much the same way as the Echo Dot, but it also offers the capability of visualizing products and using Skype to talk to family and friends. These products fill the in-between space between smart speakers and more traditional tablets and notebook and desktop computers.
While the Lenovo Smart Display and Amazon Echo Show are currently among the most well-known products, Google has announced that they are partnering with Harman and LG  on smart-display products that will be available to customers soon.
While users do not explicitly think of these products as multimodal user interfaces, most are already comfortable with the daily use of the multimodal user interfaces of smartphones and wearables. Through these multimodal smart displays, users can make requests just as they would with a voice-first smart speaker. However, the response they receive will likely be multimodal—comprising both voice and visual responses—for example, showing the user the temperature on the thermostat she is adjusting or the artist performing the song she has requested. Multimodal products are more complex in their design and their content and may well be the future of user interfaces.
Example Use Cases
Today, we have a range of devices that could be voice enabled in the future. Some of these devices might be better suited to voice-first interfaces, with either a simple visual presence or none at all. However, there are many reasons why multimodal user interfaces might eventually lead the market.
Now, let’s consider some example use cases and the rationales behind them. Among these use cases, there is certainly room for voice-first interfaces to shape and diversify product offerings in these sectors. However, there are also many ways to integrate voice into the tools we already have, as well as to conceptualize new tools that would strongly benefit from being voice enabled and fully multimodal. We’ll all be researching, designing, and using these products in the coming decades.
Shopping Use Case
The user wants to buy a pair of running shoes. Most likely, the consumer would prefer a multimodal experience that would let him visualize various models of shoes and compare their prices. However, Tech Crunch  reports that only around two percent of Amazon Alexa users currently use the product to purchase items. It will be interesting to see whether this trend changes as more and more voice technology becomes available.
Web-Search Use Case
The user wants to know about a local shop’s store hours. For such a simple, succinct question, using a voice-first tool might be the most efficient way to get the information. However, if the user wants to get follow-up information such as directions, visual content might be useful.
Medical-Device Use Case
The user wants to do a routine check on a prescription. For such habitual actions, it makes sense that the user might not need visuals. Again, if the user’s question is brief and asks for only small amounts of information, a voice-first user interface might be the right fit. However, if the user interface would require long strings of numbers such as insurance information, visual comparison of prescription tablets, or other information is necessary, a multimodal user interface might prove more effective.
Transportation Use Case
The user wants to get driving directions to a new café. Multimodal information might be the best fit, allowing the user to both hear and see the name of the road on which to turn. The GPS voice tool from Garmin Speak and Amazon Alexa  shows basic icons and numbers to help direct drivers. Seeing the intersection on a GPS is important when driving on unfamiliar roads.
Social-Media Use Case
The user wants to talk to a family member or friend in a different city. While texting and phone calls are still commonplace, the rise of Facetime, Skype, and Google Hangouts indicates that users want to see the faces of the people they care about. Facebook is also betting on multimodal displays with the recent release of Portal, a larger-screen version of Facetime that pivots to follow user movement.  Although Facebook’s continued catastrophic data breaches might make some users hesitate before bringing this product into their home, this product release demonstrates that the social-media company is investing heavily in multimodal smart displays.
Gaming Use Case
The user wants to play games and is already a fan of such popular mobile games as Candy Crush and World of Warcraft. Some games provide wonderful voice-first experiences—for example, Adva Levin’s award-winning Kids Court Alexa skill —but all users have working-memory and attention-span limitations. Voice-first games must be well structured and paced to prevent the user from becoming lost. What level is the user on? What other cards, jewels, or tools can he use in this round? All of this information might be too much to keep in his head for long durations. Florian Hollandt’s 2018 Medium article  discusses games with voice integration. This seems like a logical direction for voice and gaming. As an augmented reality (AR), voice-enabled game in the 2013 movie Her illustrates, the incredibly enticing ability to talk to games and the characters in them will be essential as multimodal user interfaces advance.
Education Use Case
The user wants to brush up on her knowledge of Spanish, Python, and Chemistry terms and theories. While the presentation of content and quizzing might help her to memorize specific facts, a voice-first interaction would make visualizing concepts and connecting them to other ideas somewhat trickier. Does that new word tecnología have an h in it? Which letter is the accent on? Did I name the variable in my code correctly? Which lines need debugging? What does the ring structure of methyl benzoate look like? While a voice assistant might be able to support the user adequately by providing verbal answers to some of these questions, the ability to see much of this information would be key to solidifying learning.
Research on Voice User Experiences
My recent UX research has focused on asking a young, affluent demographic about their experiences with multimodal and voice-first products. While most participants had heard of voice-first products by name, a whopping 20% reported never having used them at all. In this study, 134 participants tested a new multimodal product over the course of five weeks. Participants reported feeling 24% more comfortable using the multimodal product than using voice-first products. That huge difference in their comfort levels could have direct implications for their purchasing choices.
In interviews, participants reported that some voice-first products felt “creepy” and were frustrating because the products often had significant difficulty understanding their speech. While many users have touted recent advances in speech-recognition systems that better understand their speech, poor usability is still an often-cited user experience issue, as 2018 reports from Answer Labs  and the Nielson Norman Group  indicate.
While we must conduct more studies and more broadly disseminate what we’ve learned about users’ experiences with voice user interfaces, conclusions from my study point to concrete reasons why multimodal user interfaces might gain momentum and become dominant in their popularity. The most successful, market-leading products will be those that customers feel comfortable bringing into their homes and using regularly.
Conclusions and the Future
In this article, I’ve made a case for why you might want to consider keeping your product’s visual features and why voice-enabled multimodal user interfaces might become users’ preferred user interfaces in the coming years. Of course, predictions have ample room for error, and many might completely disagree with the examples and arguments in this article.
This debate will help shape the future of human-computer interactions and how they will change in the coming decade. Over time, we’ll learn what users will adapt to as the transition to a voice-enabled world continues. There may be a few product blunders and comical missteps along the way. Think back to the case of the wooden horse head  that was designed to be affixed to the front of a car to help smooth the transition between buggies and cars.
Whenever there are big advances in technology, the key is to keep a finger on the pulse of trends in user experience and how they affect the purchase, adoption, and retention of new tools.
Speech Technologist and PhD Candidate at University of Arizona
Portland, Oregon, USA
As a speech technologist and PhD candidate, Joan’s research is at the intersection of speech recognition and virtual reality. She is the principal investigator on an international, educational-technology research team, working in collaboration with the New Zealand startup ImmerseMe. Her work has been published by Cambridge University Press, The Linguist List, Issues and Trends in Educational Technology, and The FLT Mag. She has spoken at VOICE Summit by Amazon Alexa, This Week in Voice Podcast, Voice and Beyond Podcast, Rosetta Stone, [email protected], and the University of Arizona iSpace Tech Talks. She is the 2018 Winner of the Outstanding Graduate Student Award from the international Computer-Assisted Language Consortium. She holds an MA in Linguistics from the University of California, Davis, and undergraduate degrees from the University of Washington. Read More