Show and Tell: Imagining the User Experience Beyond Point, Click, and Type

By Jonathan Follett

Published: February 25, 2008

“The ability of software to recognize increasingly complex patterns like the nuances of speech and visual representations of people—provides us with possibilities for human/computer interaction that could vastly reduce the need for textual communication.”

More reliable and permanent than human memory, the technology of written language dominates as the primary method human beings use for conveying abstractions of complex ideas across space and time. The evolution of written language has complemented that of new distribution technologies—from handwritten papyrus scrolls to books and other print publications produced on offset printing presses to the pixels on our computer screens.

However, we have now reached a point at which other technologies have begun to seriously compete with written language as viable methods for not only recording our ideas, but also interacting with the world around us. The nature of our communications is changing rapidly. Immersed in these changes as we are, it’s difficult to evaluate the rate of change, but audio and video are slowly superseding text. This is not to say that text is facing extinction—but its function as the primary means of conveying information is no longer certain. And while the rise of audio and video content preceded popular use of the personal computer, application software, and the Internet, the marriage of all these technologies is creating new forms of communication. One factor—the ability of software to recognize increasingly complex patterns like the nuances of speech and visual representations of people—provides us with possibilities for human/computer interaction that could vastly reduce the need for textual communication.

What does this mean for designers of user experience? It means our tools for interaction are changing. The revolution the graphical user interface brought us will pale in comparison to the transformative change in user experience that’ s on its way.

Complex Pattern Recognition

“Computers are capable of scouring large data sets and quickly and efficiently recognizing complicated patterns in them—ones that humans would have difficulty discovering….”

It’s no secret that computers are capable of scouring large data sets and quickly and efficiently recognizing complicated patterns in them—ones that humans would have difficulty discovering, no matter how much time they had to study them.

Video and Picture Data

While we can easily identify something as complex as the face of a familiar person, we’d probably be stumped if asked to pick someone out in a large crowd at a sporting event. Security systems at casinos, border crossings, and even the Super Bowl have used facial recognition software for years, with varying degrees of success. In such systems, cameras capture images of human faces, then software compares the features of those faces to photos in a database and finds matches.

While the need for security was the driving force behind the development of this kind of software, we are beginning to apply such technology to other purposes, with intriguing results. For instance, Viewdle created its facial recognition software to index digital video, allowing owners of DV content to extract metadata from their libraries without manually reviewing and tagging the video recordings. So, if you have hours of uncatalogued digital footage of news and entertainment shows, Viewdle could review and tag your video with people’s names—perhaps enabling you to find and even sell those rare Jakob Nielsen sightings you have tucked away on your hard disk to the highest bidder.

Figure 1—Viewdle identifies people in digital videos

Viewdle identifies people in digital videos

There have been significant advances in the recognition of complex patterns in images from other types of software as well. Photosynth, a Microsoft Live Labs product, is capable of picture analysis that can piece together related photos in three-dimensional space. For example, the program can identify the relationships between various shots of a building, and no matter the data source—whether from an amateur digital snapshot, mobile phone picture, professional photograph, or high-res film scan—can reconstruct a multidimensional structure by overlaying photos, one atop another. At the March 2007 Technology, Entertainment, Design (TED) conference, Blaise Aguera y Arcas, one of the creators of Photosynth, demonstrated its capabilities to the crowd’s delight, showing a complete rendering of Notre Dame cathedral composed of thousands of photos the Photosynth software compiled automatically from Flickr.

Figure 2—Demonstration of Photosynth technology at the TED conference

Demonstration of Photosynth technology at the TED conference

With both Viewdle and Photosynth, we are beginning to see patterns of photo and video data evolving into a language that is understandable by computer software. Visual information saturates our daily lives, through body language, facial expressions, environmental indicators, and other signs and symbols we’ve come to recognize. We look at a gray sky and hypothesize that it will rain. We see cracks in a foundation and worry about a building’s structural integrity. We see a frown and wonder what we did wrong. For the present, all of these subtle visual cues, that are plain as day to us, remain hidden knowledge to computers. However, that is beginning to change, and as a result, input devices like the video camera on your notebook computer or mobile phone may become not merely tools for capturing and broadcasting pixels, but methods of interaction that enable much richer user experiences.

Audio Data

“Just as big strides in video- and photo-recognition software are changing the landscape of visual data, developments in audio recognition and analysis are similarly reshaping the way computers enter the world of sound.”

Just as big strides in video- and photo-recognition software are changing the landscape of visual data, developments in audio recognition and analysis are similarly reshaping the way computers enter the world of sound. In fact, in some ways, computers already know more about audio than we do.

A New Yorker article by Malcolm Gladwell from October 2006, “The Formula,” described the audio software of startup Platinum Blue:

In a small New York loft, just below Union Square...there is a tech startup called Platinum Blue that consults for companies in the music business. Record executives have tended to be Humean: though they can tell you how they feel when they listen to a song, they don’t believe anyone can know with confidence whether a song is going to be a hit, and, historically, fewer than twenty percent of the songs picked as hits by music executives have fulfilled those expectations. Platinum Blue thinks it can do better. It has a proprietary computer program that uses ‘spectral deconvolution software’ to measure the mathematical relationships among all of a song’s structural components: melody, harmony, beat, tempo, rhythm, octave, pitch, chord progression, cadence, sonic brilliance, frequency, and so on. On the basis of that analysis, the firm believes it can predict whether a song is likely to become a hit with eighty-percent accuracy.”

Platinum Blue’s software has successfully predicted such unexpected megahits as Norah Jones’s album “Come Away with Me” and “Crazy” by Gnarls Barkley. Based on their data, Platinum Blue’s consultants advise record-industry executives and producers on how they can improve their work, if a song doesn’t meet the criteria for a hit. Such algorithmic understanding of human experience means that, while such software may not be capable of truly learning our preferences, it may be able to anticipate them.

Voice recognition is another intriguing input technology now coming of age. There are numerous software products that are slowly, but surely improving the computer’s ability to take human speech as input. One such piece of software currently receiving plenty of attention—and advertising dollars—is Microsoft Sync, which recognizes the voice commands of drivers, allowing them to make mobile phone calls or play specific songs in their iPod music collections hands free.

Figure 3—Sync listens to a driver’s verbal commands

Sync listens to a driver's verbal commands

Show and Tell Instead of Point and Click

“The rich information we convey through verbal modes of communication …will become more important as we integrate visual and voice-recognition technologies into our user experiences.”

The mainstream application of such visual and voice-recognition technologies in combination can support a much richer user experience by increasing the types of input we can use with our computer systems. Currently, the methods of interaction we encounter in many digital products, services, and devices are highly unnatural and artificial. We adapt our behavior to these methods of input and point, click, and type , because it’s necessary if we want to avail ourselves of the capabilities these products offer.

But when we interact with other people, we rely upon methods of human communication that have evolved throughout our history—such as our natural storytelling abilities. We show and tell. The rich information we convey through verbal modes of communication—for instance, through our body language, facial expressions, tones of voice, gestures, and so on—will become more important as we integrate visual and voice-recognition technologies into our user experiences.

Imagining the Future of User Experience

As computers become capable of recognizing non-text data input, we can imagine the potential for improvements and enhancements in user experience across a wide range of digital products and services. For example, perhaps you’d like to purchase a replacement for a malfunctioning auto part from eBay, but know neither the manufacturer’s name nor the model number. You could show the computer what you mean—holding the part in front of your video camera to input that data.

“As computers become capable of recognizing non-text data input, we can imagine the potential for improvements and enhancements in user experience across a wide range of digital products and services.”

Or picture the glee of the Daily Show writers of tomorrow who need only ask their computers to find them relevant sound bites and video clips of the last dozen times a politician contradicted himself, without requiring any cataloguing or tagging by a human.

And, since social networks like Facebook, MySpace, and LinkedIn already connect our personal and professional information with our portrait photos, it’s easy to see the possibilities for facial-recognition software to enhance information exchange at networking events or other professional gatherings. Taking a snapshot of a speaker at a podium, a demonstrator in a company’s booth, or a convention attendee with your camera phone could immediately reveal to you the person’s name, resumé, portfolio, or other background information. A surreptitious snapshot would mean you’d never again embarrass yourself by failing to recall the name of someone you’d previously met. Of course, the potential for abuses and breaches of privacy with such facial-recognition technology is high. And, it’s likely that, as we work through such issues, the powers that be might impose severe limitations on the uses of such software.

As we slowly progress toward freedom from the keyboard and technology enables us to convey information to computers without requiring us to type, designers of user experiences will not only need to develop tools and techniques for managing these new kinds of interactions, but also begin to consider the social concerns and consequences these user experiences will bring to the fore. And while these fantastical possibilities might seem far off, their real-world analogues are, in fact, much closer to becoming part of everyone’s reality than you might think.

2 Comments

A couple of points: One is that whilst the Platinum Blue thing is fascinating, it isn’t actually a new departure in technology. If anything, it’s using computer technology in the way it was originally intended, to replace manual data processing and computation. The basic algorithms that it uses need to be based on comparing the measurable parameters of large numbers of records, both hit and miss (Do we still say miss?) against their success rates, and producing statistical correlations of those features against success. It’s basically a human activity that involves a huge amount of computation, which is made feasible by handing the number crunching over to a computer. The actual input of data is basic telemetry technology, no different in principle to the kind of thing that’s been used in dull stuff like the petro-chemical industry or aerospace for 50 years now.

I’m sure that when they show their results to record company execs, they do it with the sort of animated graphics favored by Hollywood blockbusters, full of animated, rotating wireframe pictures, techno-bleeps and burps, and VERY LARGE TEXT. In my experience, record company people are pretty credulous and easily impressed, and a whizzy presentation will make ‘em feel they’ve got their money’s worth and that they’re at the cutting edge of technology.

The other point is speech recognition, where recent advances could lead the unwitting to think that machines can understand what we’re saying—if only they could! It’s very easy to confuse recognizing words with recognizing the meaning of speech. The first exists now, on PCs and a large number of mobile phones. The second remains one of the great unsolved problems of AI research. However, using speech recognition for controlling devices has a great deal of potential for improving the way we interact with them, and whilst that may not help media types generate script ideas at the expense of hapless politicians, it will make life with electronics a lot less painful for most of us. Automotive demand is presently driving developments in speech recognition, but that is likely to be matched by demands from accessibility lobbies in the near future. The era of machines understanding us is still very distant, but the era of machines doing what we say is here and now.

Frightening that you want to “escape the keyboard”—and that you suggest others might want to, as well. :)

What’s interesting—and a great sign for me, a writer —is that you had to type out this whole article in order to communicate it. Suggests there’s hope for the written word yet.

Join the Discussion

Asterisks (*) indicate required information.