Usability Testing Is Qualitative Only If You Can’t Count
Published: February 21, 2011
I’ve recently found myself in a lot of discussions over the value of traditional user research methods. In particular, the value of that staple of user research we know as the usability test and its relevance in today’s world of Google Analytics and A/B and multivariate testing.
Business Leaders Don’t Understand the Value of Usability Testing
Having spent the past several years consulting on both UX management and user-centered design best practices—and, for about eight years prior to that, working with senior executives as a UX leader on staff, I’ve come to realize that too many VP- and C-level folks still have no idea how to measure the value of usability or UX design initiatives. Keep in mind that the key to long-term success in any corporate setting is proving our impact through objective metrics. Successful businesses are managed using numbers. Anyone who says otherwise is naïve.
Much of what appears to be senior management’s irrational behavior in regard to user experience in general and usability testing in particular results from their inability to get their head around how to measure the value of user experience or usability testing. In many companies, there is now strong demand to improve product usability, but most executives lack sufficient understanding of how to measure the effectiveness of UX efforts. As Peter Drucker has said, “If you can’t measure something, you can’t manage it.” This tends to lead to inaction—or worse yet, micromanagement by people who think they are Steve Jobs, but lack his UX savvy.
Failing to Defend Small Sample Sizes
I believe one key problem is that, as UX professionals, we naturally strive to come up with simple descriptions of complex things, but often fail to do so. We need to keep in mind that it is important to avoid oversimplifying to the point where we confuse both ourselves, and those we collaborate with. This is especially true now that the Internet lets us broadcast our thoughts to a broad audience, in a format where casual readers Googling for quick answers often consume them without much reflection.
Let me give you an example. Last year, a contact in India asked me to review a presentation that cited Steve Mulder’s book The User Is Always Right: A Practical Guide to Creating and Using Personas for the Web. Steve’s book contains a chart that categorizes usability testing as a qualitative research method. My contact was using that chart to explain user research to the next generation of UX professionals there. Here’s the problem. That chart is very misleading; in fact, I’d say it’s just wrong.
My contact in India sent an email message off to Steve, including my comments and CC’ing me. Steve admitted he’d oversimplified things, because in his experience, the use of the term quantitative research confuses many teams when researchers apply it to small-sample studies and that wasn’t really the topic of his book. He also noted that, on the teams with which he’s worked, most conduct usability testing “more as interviews than observational studies, unfortunately.”
My response to Steve was that he should send his confused coworkers over to Jeff Sauro’s site instead of glossing over the issue, because Jeff does an excellent job of explaining small-sample statistics for use in design on his blog Measuring Usability. I copied Jeff on my reply, as well as some other friends on the UPA Board of Directors, who have been discussing training and documenting best practices a fair amount recently. I applaud Jeff for writing several good blog posts on the topic shortly thereafter, including “Why You Only Need to Test with Five Users (Explained)” and “A Brief History of the Magic Number 5 in Usability Testing.”
What’s the Impact of A/B Testing’s Popularity on User Research?
So, how does all of this relate to A/B testing? Here’s how. During the recent economic downturn, several of the folks I’ve worked with over the years who are excellent user researchers have found themselves out of work. Why? Well, I suspect the popularization of A/B and multivariate testing could explain some part of this. The perception exists in the minds of many executives that A/B and multivariate testing provide better data—or more specifically, data that is quantitative—and thus, eliminate the need to do any other type of user research. This perception concerns me, as it should any UX professional.
Interestingly, unlike traditional usability testing, A/B and multivariate testing are now familiar to many of the executives I talk with today. I believe that’s because the people who have gotten involved with Web analytics tend to work in marketing research and, not surprisingly, that means they’re pretty good at communicating the value of a service like A/B testing. Or, at least, better at marketing the value of A/B testing than the human-factors types who tend to be experts in small-sample usability testing are at communicating its value. The result? Many executives—and even some UX teams—have latched onto A/B testing as if it were some sort of silver bullet. It’s not. However, as all UX professionals know, perception is often more important than reality, especially when people in powerful positions—many of whom are statistically impaired—hold a particular perception.
Instead of reiterating the points Jakob Nielson made back in 2005, in his article on the pros and cons of A/B testing, “Putting A/B Testing in Its Place,” let me add a few points I don’t think anyone has communicated well so far.
We Must Strive to Communicate Clearly
UX professionals should strive to eliminate the misinformation that is out there about user experience—and user research and usability testing in particular. Vendors who sell services based on Web-traffic analysis or automated testing have perpetuated much of it. The rest comes from novices moving into the field, without sufficient education in basic statistics. I’m shocked by the number of candidates I’ve interviewed over the years, when building UX teams, who couldn’t answer my standard interview questions about what makes data quantitative. Many think quantitative data relates solely to sample size and have no understanding of the concepts of categorical, ordinal, interval, and ratio data sets. Let me provide some examples of how to categorize user research data:
- categorical—Nominal categories are simply labels for different types of things. For example, when UX professionals create personas, we’re essentially categorizing types of users—or market segments, if you’re considering who might use a product. You can count categories.
- ordinal—When we rank things, we’re creating ordinal data. For�example, if you ask users to list their top ten ideas for improving your product, you’ll get ordinal data. You can compare ordered lists.
- interval—When we collect satisfaction ratings from users on a standard Likert scale—with ratings from one to seven—we’re collecting interval data. To clarify, this lets you say that you’ve observed a difference of a certain size. When measuring responses on an interval scale, you can treat the measured differences between responses as ratio data. You can state that you’ve observed double the difference, but not twice as much of the underlying amounts—as confusing as that sounds. [Author’s correction]
- ratio—When we count how many users have successfully completed a task using a product, we obtain ratio data. A good way of quickly identifying ratio data is to ask whether the data could have a value of zero. This type of data is helpful because you can safely perform mathematical analyses on it that would not be possible with other types of data. You can safely average ratio data. This includes any of the variations of the generalized mean that depend on multiplication and division operations or any other statistics that rely on those operations.
If you want to learn more about these different types of data, see the classic paper “On the Theory of Scales of Measurement,” by S. S. Stevens, an early pioneer in engineering psychology, the science behind usability testing.
If these concepts are foreign to you, you shouldn’t call yourself a user researcher—at best, you’re a skilled moderator or research assistant. Yes, I realize few industry jobs have historically required knowledge of inferential statistics, but I’m not talking about knowing the difference between ANOVA (Analysis of Variance) and ANCOVA (Analysis of Covariance). I’m talking about understanding that you can’t derive a meaningful average color no matter how many people answer the question: What is your favorite color?
Blending Qualitative and Quantitative Approaches to Research
As interest in usability and user experience increases—and I believe this is a strong trend—there will be innovations that change how we do things. A/B testing is not the first of these. Jakob Nielsen’s promotion of the concept of discount usability testing sparked the change that created most of our jobs today, moving product design research out of corporate labs and into product-development organizations. Beyer and Holtzblatt’s contextual design helped get teams out of the labs and into the field. As a profession, we need to be able to communicate the pros and cons of these new methods and their associated tools, without glossing over important differences.
It’s important that we don’t discount so-called discount usability testing by calling it qualitative, even if doing so simplifies communications in the short term. This just perpetuates the misperceptions of people who don’t realize that the mainstream scientific community has long recognized the fact that the inclusion of qualitative factors is critical in both hypothesis formulation and the analysis of any data in science. Thomas Kuhn, the famous physicist and author, wrote in The Function of Measurement in Modern Physical Science:
“Large amounts of qualitative work have usually been a prerequisite to fruitful quantification in the physical sciences.”
In taking strictly qualitative or quantitative approaches to any problem, we lose the advantages of combining these approaches, making our data subject to questions of validity or interpretation. Such questions are the main reason many usability problems go unaddressed.
So What Am I Proposing?
—including early studies, using small samples to gather qualitative and quantitative insights into what users do, why they do it, and what they really want to do.”
Instead of positioning ourselves as quantitative or qualitative user research specialists, UX professionals should strive to become experts in the selection of appropriate research methods, including anything new that promises to make user research faster, cheaper, or better in general—like the new breed of remote user research tools that have hit the market in the past five years. My guess is that the recent discontinuation of TechSmith’s UserVue was a result of too many people jumping on the A/B testing bandwagon—without fully realizing what they were giving up.
As a profession, we have failed to explain clearly the value of small-sample studies to the people who award our budgets. We need to fix that. I just hope some promising new tools that allow us to collect rich data more efficiently can help us overcome the growing bias toward doing only A/B testing, which lacks explanatory value. Instead, we should be leveraging A/B test data along with data from other worthwhile methods—including early studies, using small samples to gather qualitative and quantitative insights into what users do, why they do it, and what they really want to do.
We should also keep our minds open to other new tools and methods, including new remote usability testing techniques that tools like Userlytics have enabled. Userlytics lets you gather rich qualitative data via video capture, in much the same way traditional usability testing does, but much more efficiently. Such tools can help us gather richer data—including behavioral data and verbal protocols—in about the same amount of time A/B testing requires, but much earlier in the development process, with the result that the costs of data-driven design iterations are lower.
Otherwise, we’ll continue to waste a lot of time and money building stuff just for the sake of A/B testing it. Ready, shoot, aim is not a recipe for success. Especially when you lack the qualitative insights to figure out why you hit what you hit.
From the Editor—I want to welcome our new sponsor, Userlytics, and thank them for asking Jon Innes, who works for them on a consulting basis, to write this article for UXmatters. Userlytics provides a comprehensive remote usability testing service that includes planning, participant recruitment, testing, and reporting. Their software captures a synchronized record of participants’ interactions with their computer, spoken remarks, and facial expressions.
Drucker, Peter. Management: Tasks, Responsibilities, Practices. New York: Harper Collins, 1973.
Kuhn, Thomas S. “The Function of Measurement in Modern Physical Science.” Isis, Vol. 52, 1961.
Mulder, Steve, and Ziv Yar. The User Is Always Right: A Practical Guide to Creating and Using Personas for the Web. Upper Saddle River, NJ: New Riders Press, 2006.
Nielsen, Jakob. “Putting A/B Testing in Its Place.” Alertbox, August 15, 2005. Retrieved February 19, 2011.
Sauro, Jeff. “A Brief History of the Magic Number 5 in Usability Testing.” Measuring Usability, July 21, 2010. Retrieved February 19, 2011.
Sauro, Jeff. “Why You Only Need to Test with Five Users (Explained).” Measuring Usability, March 8, 2010. Retrieved February 19, 2011.
Stevens, Stanley Smith. “On the Theory of Scales of Measurement.” Science, Vol. 103, No. 2684, July 7, 1946.