One constant in truly multimodal systems is change. When you could assume a customer would remain in the same context for the duration of an interaction, the only change that designers had to consider was the change in the customer state enabled by the user interface (UI). Don’t want a change? Then don’t build an affordance for that change. Simple.
Multimodal systems are defined by their flexibility. Without the ability to transition between inputs and outputs, all a multimodal system provides is a single choice up front about an engagement. It’s essentially several mutually exclusive single-mode interactions.
In reality, multimodal design is about so much more than the customer state. Within an individual state or moment, the interaction is constrained, and can be designed via fairly traditional means. The highest risk and greatest challenge in multimodal design lie in the transition states. Transition can take many forms in advanced experiences:
changing input or output modalities
switching devices during a single, end-to-end task
multiple customers sharing a single device
Design of a truly multimodal system will require you to look between moments. You’ll need to channel your inner child and look back to your classic connect-the-dots puzzles. While the dots are the foundation of those pictures, you can’t experience the image without exploring what’s in between.
Some designers refer to a transition between different modalities as an interaction cliff. When your customer transitions from one input or output method to another—whether intentionally, unintentionally, or due to necessity—they are at risk of losing context, time, and even safety to the transition. Table 9.1 details several cliff archetypes that can threaten the cohesion of your customer’s experience.
Table 9.1—Common Multimodal Cliffs
A customer transitions between two or more input modalities during a single activity.
The system changes the way it communicates with a customer in the middle of an activity.
A system responds to a customer request using an output modality that does not match the input modality the customer used to make the request.
Input Transitions and Fluid Multimodality
Increasingly, it’s possible to move fluidly between input modalities.
In some cases, this transition is voluntary: a customer is changing their preferred mode of interaction because they believe the new input modality will be easier or more appropriate for use.
In other cases, the transition is involuntary: the current input is deemed insufficient by the system, and the customer must switch inputs to continue their desired activity.
It’s important to look out for both voluntary and involuntary input transitions. Either way, these transitions are critical moments in the successful completion of an end-to-end scenario. Table 9.2 includes several example transitions from past and present consumer experiences.
Table 9.2—Example Input Transitions
A customer uses voice to set up an appointment, and the system recognizes an incorrect time. Rather than speaking the correction, the customer uses a mouse and keyboard to make the change.
Google Home Hub
A customer moves their hand in front of the camera to pause a video, and then decides to use touch to scrub backward for something they missed.
Amazon Fire TV
A customer uses a remote control to browse featured movies and doesn’t see anything good. Instead, the customer launches another streaming app using their voice.
A customer is interacting using touch, but decides abruptly to switch apps and does so by using a system gesture.
A customer is using their hands to interact with a game, but must say Xbox, open that to get the contents of a notification.
Observe your customers interacting with your system.
What input do they typically start with?
What input do they prefer, if shown all options?
Are they naturally driven to switch modalities partway through?
Are there common points in the process where the switch occurs?
Patterns in voluntary input transitions are both educational opportunities—attempting to make the ability to transition more discoverable—and design opportunities—ensuring that there is no unnecessary friction during or after the switch.
When your system forces customers to switch modalities, you are inherently disregarding the customer’s implied preference for input modality. Treat this moment with particular caution.
Is it clear to the customer what the next step is?
What happens if the customer can’t make the transition? For example, a TV system that forces a transition from voice to remote control, but the remote control is not within reach.
Who might be excluded by the forced transition? Forced transitions can be a particular challenge for inclusion and accessibility efforts.
In some cases, systems must change their own output mid activity due to the environment or the nature of the request. While input transitions tend to occur during complex interactions, output cliffs are more likely to occur at the end of simple interactions. It is your responsibility as a designer to direct the customer’s attention to the output in its new form, as in most cases they are unlikely to expect the output transition. Table 9.3 describes a few example situations where an output transition occurs within the scope of a single interaction.
Table 9.3—Example Output Transitions
A customer initiates a hands-free request: Hey Siri, when is the next Star Wars movie coming out? Instead of a voice response, a list of search results is displayed.
A customer initiates a hands-free request: Take a photo. The system plays audio cues during the photo process, but the photo can only be displayed via a phone app.
A customer makes a change to their speaker groups while audio is playing. The change is manifested automatically via the output to the affected speakers.
When designing an output transition within the scope of a single task, be sure to bridge the cliff by providing some sort of indication in the moment that additional output is available elsewhere.
The most reliable technique for bridging the gap between outputs is to use both output modalities instead of a hard switch. Siri could reply, I found lots of possible answers for your question. Check your phone for a list of the top results.
You may also be able to leverage established patterns on your platform for directing attention. We did this when designing the photo experience for Amazon’s Echo Look, which doesn’t have a screen. A shutter sound indicates the photo is taken, and a notification directs the customer’s attention to the phone when appropriate.
Beyond the immediate challenges of directing customer attention within a single chunk of output, additional problems may be caused by a mismatch of customer input and system output modalities.
An important rule of thumb as a multimodal designer is to respond in kind. Just as it’s peculiar to respond to sign language with a fully spoken response, it’s a bit unexpected if your system suddenly responds to voice input with some kind of nonvoice output.
However, in certain circumstances responding in kind is either insufficient or impossible. When searching using voice, there are times when all of the potential options sound too similar to be effectively disambiguated by name alone:
Customer:Call Jane Smith.
System:Which Jane Smith? I found two matches.
NOTE: DEFINING DISAMBIGUATION
In the context of voice UI or any search-driven UI, disambiguation is essentially a form of filtering. Customers disambiguate between similar results using secondary properties or indicators.
How can the customer indicate their intended target when the results lack acoustic uniqueness? One option is to allow sophisticated filtering—for example, Jane Smith from Boston—but few systems operate at this deeper contextual level. At minimum, you can use visual output to show several pieces of information about Jane in hopes of helping customers disambiguate.
You might also run into a situation where the unique data about your results are too unwieldy to speak. For example, in Figure 9.1, when Siri (iOS 13.4.1) finds multiple phone numbers with the same label as results for a spoken query, she reads the phone numbers digit by digit and expects them to be spoken in kind. It’s far faster to tap the screen to make a selection. If it’s possible that such unwieldiness may be the case, allowing customers the choice of switching—or, if you know in advance there will be an issue, proactively transitioning customers to a new modality—is actually setting them up for success.
The cliff is the gap between speech input/output and visual output. When customers are speaking to a device—especially smart speakers—there is no guarantee they’re in visual range of a device, so how can they reply? And even if they’re in range, what are the odds they are looking at the screen, expecting a list?
As a designer, you’ll have to account for this cliff and find some way of directing human attention to the screen where disambiguation information is displayed.
In general, it’s preferable to avoid creating an input/output mismatch by responding in kind to customer requests. In some cases, it’s acceptable to reply to an input with an additional form of output, but a hard mismatch is likely to cause problems for discoverability and inclusion.
Media Center Cliffs
Media centers such as Microsoft’s Xbox, Apple TV, or Amazon’s Fire TV demonstrate particularly noticeable input/output mismatch gaps. Even though it’s possible to open an app by name on many entertainment systems, few of them currently support fully voice-based journeys end to end.
Browsing isn’t well suited to spoken output, so most of these systems must find a way to direct attention from the verbal exchange to the visual results. Not all of them are particularly effective at directing attention with intention.
Furthermore, many media systems still shy away from allowing purchase or rental via voice due to concerns about high error rates. As a result, customers are forced to stop using their voice and pick up a remote to complete their task.
These cliffs seem innocuous until the customer’s environment is considered. Why did the customer choose voice to interact with the system? Odds are, the remote control isn’t within arm’s reach. Forcing a customer to begin interacting with a physical controller that may not be available is likely to cause frustration, task abandonment, and even exclusion.
Ask yourself what obstacles are preventing you from supporting the customer’s chosen input end-to-end. You might face one or more problems such as:
Is there technical debt on the platform side preventing this end-to-end interaction?
Is there an organizational concern that adding this functionality will impact sales negatively?
Are you facing a simple lack of resources?
Once you understand the drivers behind these obstacles, find ways to capture both the cliffs and the drivers behind them in your backlog for future commitments.
Between Network Connections
Another form of generally involuntary transition of experience is a loss of connectivity. In the early days of cloud service offerings, it was seen as reasonable to suspend interaction when connectivity was suspended. After all, the data’s in the cloud, and you need network access to connect to the cloud!
But customer tolerance for such inflexible engineering is waning. As digital systems go from optional to critical path, it’s no longer acceptable to spend any length of time without service, even if that service is degraded due to network connection transitions. Table 9.4 explores the five most common types of network problems that could impact your experience.
Table 9.4—Network Connection Issues
Intentional Connection Loss
Customer chooses to interrupt their connection.
Switching to airplane mode on a plane
Intermittent or Unstable Connection
Connection drops in and out frequently over time, or the strength of connection varies.
Connection at high speeds, in car or bus
Connecting in rural or developing areas
Connection is noticeably slower than expected or needed.
Concerts or disasters, where a high volume of people attempt to use limited network resources
Device moves too far from point of service.
Smart watches and fitness devices when a customer walks away from their phone
Network connection that should be available ceases to function for unknown reasons.
Cut cable or power outage—often beyond a customer’s control
NOTE: HANDHELD GAMING
My time working on portable gaming consoles forced me to see these connection problems as our problem. If there’s one thing you learn from developing games on Nintendo devices, it’s that you are not absolved of responsibility to provide a good experience for your customers when the connection fails. We even had to build makeshift Faraday cages to simulate some of these conditions prior to certification!
While it is sometimes reasonable to prevent customers from starting an interaction when the network conditions are poor, it’s much worse to deny customers continued access once they’ve begun an interaction due to a transition in their network conditions. So how do you cope?
If your system detects a network transition that is impacting the customer experience, find a way to let your customers know.
While it’s true that you’re rarely the cause of network issues, and while it’s even true that your customer’s operating system is probably alerting them at some level about connection problems, that doesn’t absolve your product from doing the same. Without carrying this transparency through your own product, you risk causing panic when a customer who’s not looking at the big picture believes your app has lost data.
Some common patterns for network connection transparency:
At a minimum, some apps display a “Not connected” or “No network connection” warning, as in the Mail app in iOS (13.4.1) that is shown in Figure 9.2, which includes a very small indicator that network connectivity is down. Did you spot it at first glance?
The Outlook mobile app goes a bit farther, communicating not just the connection status but the scope of missing data and the next steps upon connection. Without this in-app awareness, customers might misinterpret a lack of new data as a quiet inbox. Microsoft’s Outlook mobile app for iOS has evolved to include rich information even during periods of low connectivity, as shown in Figure 9.3, heading off the panic that an artificially empty inbox might cause.
In a hands-free world, not all customers will be in range of visual indicators when connectivity issues occur. Amazon Echo devices do display a red indicator when a connection isn’t present for a significant period of time. However, if customers attempt to interact during an outage, they will receive a spoken message along these lines: I’m having trouble connecting to the Internet right now. Please try again in a little while.
Ideally, you’ll communicate the fact that there is a network issue, and not just that there’s some generic problem. Without that specificity, how will your customers know to look into the problem and whether they can fix it? But while the figures contained here are visual, remember that your customer may not be in range of a visual indicator
NOTE: CATCH-404: PLANNING AHEAD
When planning for network connection errors, remember that all assets and error messages must be stored locally. All of Alexa’s connection-error messages are stored locally as MP3s since under those circumstances the text-to-speech service would not be available. Use the same logic for any mission-critical icons, graphics, videos, or texts that your customer might need during an outage.
Modern devices are too sophisticated for brittle connection models. In today’s conditions, you must assume that your customers will encounter intermittent, unstable, and insufficient connections on a fairly regular basis—especially for any mobile or wearable devices.
In many cases, basic resiliency means some form of caching. Save a record of any changes made locally until you have received firm confirmation that those changes have been successfully posted to your cloud service.
Avoid depending on regular heartbeats or communications. Build systems that can skip a beat and still function within reason.
NOTE: RESILIENCE AND NEW MARKETS
A product that can handle unstable connections may find itself relevant in entirely new markets. Many products designed for use in the United States are impractical in developing countries and rural areas. Ask yourself: What new customers and markets could you encounter if you built in tolerance for low-connection scenarios?
Plan for an Offline Mode
There are plenty of conditions where customers will choose to disable their Internet connections. Of course, the most common is the classic airplane-mode scenario on planes. But beyond that scenario, consider these other situations:
Your customer is abroad and can only access the INTERNET from sporadic Wi-Fi connections.
Your customer is affected by a widespread INTERNET outage.
Your customer can’t connect to your service due to a problem on your end.
Your customer is concerned about the safety or security of the connections available to them.
Your customer is on a metered connection and must limit their access.
Offline modes seem like a “nice to have” until you consider how little you and your customer control their connections. An offline mode is an excellent way to handle the need for resiliency in low-bandwidth situations while also supporting your customers during fully interrupted communications.
Think back to the R (Relationship) in CROW, from Chapter 2, “Capturing Customer Context.” Odds are that your customer has relationships with many devices in their life. For all of the rich possibilities that exist on a single multimodal device, there’s another continuum of experience beyond a 1:1 relationship between device and customer. How might your experience scale or stretch to accommodate the other device relationships that your customer deals with on a regular basis?
Multiple Devices, Single Environment
When the Amazon Echo was initially released, the beta nature of the release meant you could assume there was a single Echo device in each household. However, multidevice households began to emerge within the first year of the product’s release. The arrival of more affordable devices such as the Echo Dot compounded that trend.
But the combination of far-field microphones and multiple devices can be problematic without forethought. If a customer has multiple Alexa devices within earshot of each other, they will all respond unless a customer has set unique wake words for each device. When similarly capable devices such as Alexa devices share a small space, multiple devices may respond to a single request, as shown in Figure 9.4, which depicts a floor map of a potential multi-Alexa household.
While the wake words are a valid strategy for now, this is not a graceful solution. It places all the burden on customers to identify the problem, learn about their options, and change the configuration.
A fairly unexplored solution for the multidevice, single-environment scenario is device arbitration. What if all of the devices that heard a customer could briefly confer and choose the best representative to lead the interaction? Metrics could include the following:
Recency—Which device received the last interaction?
Multimodality—Did a device recently receive a nonvoice input such as touch?
Proximity—Which device is closest to the customer?
Appropriateness—Which device makes the most sense for the request?
Device arbitration does require that all devices be aware of each other and able to communicate in real time. But many networks can handle this sort of interaction with some engineering work. And device arbitration will become even more critical as people’s devices become almost universally multimodal—and as cross-compatibility means more devices have the same feature set.
Multiple Devices, Single Scenario
An emerging best practice is to allow customers to suspend an interaction on one device and resume it on another device. When an experience moves beyond the boundaries of a single device, a great experience requires these devices to speak the same language. What information needs to be shared between devices? Where is it stored and for how long?
Microsoft’s Outlook products have long struggled with calendar notifications. Customers might leave their laptop at their desk to attend meetings with their phone and return to half a dozen stale meeting reminders piled up at their digital workstation.
While the intent behind the redundant cross-device notifications is good—a desire to ensure that customers don’t miss any information—it often has the opposite effect. A pile of stale meeting reminders renders the signal-to-noise ratio too high, and customers begin to dismiss all of them or ignore the noisy channels.
When pushing notifications or content to customers with whom you interact on multiple devices, consider what signals you might interpret to determine the most appropriate device for that information.
Do you know where the customer’s last interaction was?
Does the type of content limit the devices that may be relevant?
Can and should you remove a notification on all platforms when it is dismissed from one device?
From time to time, your customer may want to intentionally transition from one device to another. Table 9.5 captures three of the most common intentional transitions.
Table 9.5—Common Directed Transition Archetypes
Your customer is moving between physical spaces or adapting to changing conditions.
At the end of a drive, you are directing a podcast from the car to continue in the home.
Google Assistant-enabled devices share a state so that a podcast can be resumed from anywhere.
Your customer has multiple ways to complete a task and prefers different devices for different tasks.
Instead of reaching for a remote control, a customer chooses to use their phone because it’s the closest device at hand.
The Denon HEOS AV receiver allows customers to adjust settings from a mobile app, in addition to the remote control and the receiver’s physical controls.
Your customer’s goals have changed, and they have hit the limits of capabilities on a particular device.
A customer is reviewing a spreadsheet on their phone, but switches to a PC to make changes to several formulas.
Microsoft Office products track the latest position in cloud documents. When picking back up on a new device, customers see Welcome back! Here’s where you left off.
From the Chromecast to the Nintendo Switch, these sorts of directed transitions are becoming more commonplace. Keep an eye out for potential situations where your customers may expect or need the ability to swap devices midscenario. In many cases, these scenarios can be supported with a bit of early planning about the key elements of your customer’s state that you’ll need to share across all devices.
More Complex Than Wake Words
The wake word fix for multidevice Alexa households doesn’t scale. Of the limited wake word options, computer and Amazon are too common in ordinary speech to be truly useful. My household uses computer under duress to keep both downstairs devices—Alexa and computer—and an upstairs Echo device from responding at once, and it becomes a problem every time we watch an episode of Star Trek.
This is currently a good problem to have. Certainly, this many devices in a small home is pushing the platform beyond system specifications. But as more devices become far-field, voice-recognition-enabled, many more mainstream consumers will run up against this kind of challenge without thoughtful design. Wake words won’t always be enough.
The industry still tends to assume that devices are owned and operated by a single person. From tablets to mobile phones, this assumption is baked in at the most basic levels. But many of today’s devices exist in shared environments.
Household use of devices such as laptops, desktops, and smart speakers is fairly common. But when multiple people share a single device in a single account, state and context are often lost. To provide a more seamless personal experience in these shared scenarios, many apps and devices support the creation of multiple identities or profiles.
How will customers know which profile is active?
How will they switch profiles?
Are there any settings or states that should be shared across profiles?
Do you need different types of profiles? (A common split is adult versus child or minor profiles.)
While many devices now support multiple customer logins, there’s rarely support for multiple people using a device as equal peers. For example, the Xbox family of devices allows multiple profiles to be logged in at once—but only one of those profiles sees their favorites, and that lead profile controls all app logins. If Aya is subscribed to Hulu, but Jo is the primary person signed in, Jo won’t be able to open Hulu without swapping profiles first.
Voice-controlled devices also struggle with simultaneous usage by multiple people. Humans tend to talk over each other, and their voices can be hard to distinguish. Furthermore, it’s fairly easy for one customer to make a request from another customer’s profile.
How often are your customers in the room with other people when they interact with your experience?
Are those customers attempting to collaborate or simply to make requests and commands?
Do all customers share the same access levels? If not, how will your customers know whose access is being applied to a request?
Apply It Now
Unlike the desktop systems of yesterday, multimodal and cross-device experiences are wibbly-wobbly, timey-wimey adventures through space and time. Along those lines, the design for these experiences turns out to be similar to the TARDIS from Doctor Who: bigger on the inside than they appear.
Consider your customer’s travel between:
Modalities—Transitions between input modalities have a disproportionate impact on overall user experience.
Make every effort to minimize involuntary transitions.
Support voluntary transitions whenever possible.
Ensure that you respond in kind to your customer’s requests.
Networks—Your customer’s connection to the Internet is not a fixed point in space or time. Proactively consider network instability, failures, and insufficiency to ensure that you’re not designing for a fictional customer context.
Devices—Your customers don’t exist in a vacuum, and they’re surrounded by devices.
Will your device share its environment with similar devices? If so, how can you imbue your device or experience with situational awareness to lead to better experiences?
Will your customer interact with you on multiple devices? If so, how might you avoid spamming your customers with content across all the devices they’re using?
When should you support a customer’s intentional transitions between devices?
People—Many of today’s devices are shared in some way. When multiple people share a single device, personal preferences and context are often lost along the way.
How might support for different customer profiles make your experience more compelling?
When, how, and why would your customers switch profiles?
re there any situations where multiple customers are using the device at the same time? How might you help them pool their resources and context?
Principal Designer at the Bill & Melinda Gates Foundation
Principal Designer & Owner at Ideaplatz, LLC
Seattle, Washington, USA
Cheryl is an internationally renowned interaction designer who is best known for her work on a wide variety of emerging technologies and products—including Amazon’s Alexa voice platform and the Echo Look, Microsoft’s Cortana and the Azure platform, and groundbreaking early titles for the Nintendo DS. At the Bill & Melinda Gates Foundation, she is currently focusing on improving digital collaboration. Her design-education firm, Ideaplatz, offers design instruction that empowers today’s designers to build tomorrow’s experiences. Cheryl holds a degree in computer science and human-computer interaction from Carnegie Mellon University. She has been a professional improviser and performer more than a decade. Read More