Usability Testing Is Qualitative Only If You Can’t Count

By Jon Innes

Published: February 21, 2011

“Too many VP- and C-level folks still have no idea how to measure the value of usability or UX design initiatives.”

I’ve recently found myself in a lot of discussions over the value of traditional user research methods. In particular, the value of that staple of user research we know as the usability test and its relevance in today’s world of Google Analytics and A/B and multivariate testing.

Business Leaders Don’t Understand the Value of Usability Testing

Having spent the past several years consulting on both UX management and user-centered design best practices—and, for about eight years prior to that, working with senior executives as a UX leader on staff, I’ve come to realize that too many VP- and C-level folks still have no idea how to measure the value of usability or UX design initiatives. Keep in mind that the key to long-term success in any corporate setting is proving our impact through objective metrics. Successful businesses are managed using numbers. Anyone who says otherwise is naïve.

Much of what appears to be senior management’s irrational behavior in regard to user experience in general and usability testing in particular results from their inability to get their head around how to measure the value of user experience or usability testing. In many companies, there is now strong demand to improve product usability, but most executives lack sufficient understanding of how to measure the effectiveness of UX efforts. As Peter Drucker has said, “If you can’t measure something, you can’t manage it.” This tends to lead to inaction—or worse yet, micromanagement by people who think they are Steve Jobs, but lack his UX savvy.

Failing to Defend Small Sample Sizes

“The use of the term quantitative research confuses many teams when researchers apply it to small-sample studies.”

I believe one key problem is that, as UX professionals, we naturally strive to come up with simple descriptions of complex things, but often fail to do so. We need to keep in mind that it is important to avoid oversimplifying to the point where we confuse both ourselves, and those we collaborate with. This is especially true now that the Internet lets us broadcast our thoughts to a broad audience, in a format where casual readers Googling for quick answers often consume them without much reflection.

Let me give you an example. Last year, a contact in India asked me to review a presentation that cited Steve Mulder’s book The User Is Always Right: A Practical Guide to Creating and Using Personas for the Web. Steve’s book contains a chart that categorizes usability testing as a qualitative research method. My contact was using that chart to explain user research to the next generation of UX professionals there. Here’s the problem. That chart is very misleading; in fact, I’d say it’s just wrong.

My contact in India sent an email message off to Steve, including my comments and CC’ing me. Steve admitted he’d oversimplified things, because in his experience, the use of the term quantitative research confuses many teams when researchers apply it to small-sample studies and that wasn’t really the topic of his book. He also noted that, on the teams with which he’s worked, most conduct usability testing “more as interviews than observational studies, unfortunately.”

My response to Steve was that he should send his confused coworkers over to Jeff Sauro’s site instead of glossing over the issue, because Jeff does an excellent job of explaining small-sample statistics for use in design on his blog Measuring Usability. I copied Jeff on my reply, as well as some other friends on the UPA Board of Directors, who have been discussing training and documenting best practices a fair amount recently. I applaud Jeff for writing several good blog posts on the topic shortly thereafter, including “Why You Only Need to Test with Five Users (Explained) and “A Brief History of the Magic Number 5 in Usability Testing.”

What’s the Impact of A/B Testing’s Popularity on User Research?

“The perception exists in the minds of many executives that A/B and multivariate testing provide better data—or more specifically, data that is quantitative—and thus, eliminate the need to do any other type of user research.”

So, how does all of this relate to A/B testing? Here’s how. During the recent economic downturn, several of the folks I’ve worked with over the years who are excellent user researchers have found themselves out of work. Why? Well, I suspect the popularization of A/B and multivariate testing could explain some part of this. The perception exists in the minds of many executives that A/B and multivariate testing provide better data—or more specifically, data that is quantitative—and thus, eliminate the need to do any other type of user research. This perception concerns me, as it should any UX professional.

Interestingly, unlike traditional usability testing, A/B and multivariate testing are now familiar to many of the executives I talk with today. I believe that’s because the people who have gotten involved with Web analytics tend to work in marketing research and, not surprisingly, that means they’re pretty good at communicating the value of a service like A/B testing. Or, at least, better at marketing the value of A/B testing than the human-factors types who tend to be experts in small-sample usability testing are at communicating its value. The result? Many executives—and even some UX teams—have latched onto A/B testing as if it were some sort of silver bullet. It’s not. However, as all UX professionals know, perception is often more important than reality, especially when people in powerful positions—many of whom are statistically impaired—hold a particular perception.

Instead of reiterating the points Jakob Nielson made back in 2005, in his article on the pros and cons of A/B testing, “Putting A/B Testing in Its Place,” let me add a few points I don’t think anyone has communicated well so far.

We Must Strive to Communicate Clearly

“Many think quantitative data relates solely to sample size and have no understanding of the concepts of categorical, ordinal, interval, and ratio data sets.”

UX professionals should strive to eliminate the misinformation that is out there about user experience—and user research and usability testing in particular. Vendors who sell services based on Web-traffic analysis or automated testing have perpetuated much of it. The rest comes from novices moving into the field, without sufficient education in basic statistics. I’m shocked by the number of candidates I’ve interviewed over the years, when building UX teams, who couldn’t answer my standard interview questions about what makes data quantitative. Many think quantitative data relates solely to sample size and have no understanding of the concepts of categorical, ordinal, interval, and ratio data sets. Let me provide some examples of how to categorize user research data:

  • categoricalNominal categories are simply labels for different types of things. For example, when UX professionals create personas, we’re essentially categorizing types of users—or market segments, if you’re considering who might use a product. You can count categories.
  • ordinalWhen we rank things, we’re creating ordinal data. For�example, if you ask users to list their top ten ideas for improving your product, you’ll get ordinal data. You can compare ordered lists.
  • interval—When we collect satisfaction ratings from users on a standard Likert scale—with ratings from one to seven—we’re collecting interval data. To clarify, this lets you say that you’ve observed a difference of a certain size. When measuring responses on an interval scale, you can treat the measured differences between responses as ratio data. You can state that you’ve observed double the difference, but not twice as much of the underlying amounts—as confusing as that sounds. [Author’s correction]
  • ratio—When we count how many users have successfully completed a task using a product, we obtain ratio data. A good way of quickly identifying ratio data is to ask whether the data could have a value of zero. This type of data is helpful because you can safely perform mathematical analyses on it that would not be possible with other types of data. You can safely average ratio data. This includes any of the variations of the generalized mean that depend on multiplication and division operations or any other statistics that rely on those operations.

If you want to learn more about these different types of data, see the classic paper “On the Theory of Scales of Measurement,” by S. S. Stevens, an early pioneer in engineering psychology, the science behind usability testing.

If these concepts are foreign to you, you shouldn’t call yourself a user researcher—at best, you’re a skilled moderator or research assistant. Yes, I realize few industry jobs have historically required knowledge of inferential statistics, but I’m not talking about knowing the difference between ANOVA (Analysis of Variance) and ANCOVA (Analysis of Covariance). I’m talking about understanding that you can’t derive a meaningful average color no matter how many people answer the question: What is your favorite color?

Blending Qualitative and Quantitative Approaches to Research

“As interest in usability and user experience increases…, there will be innovations that change how we do things. … We need to be able to communicate the pros and cons of these new methods….”

As interest in usability and user experience increases—and I believe this is a strong trend—there will be innovations that change how we do things. A/B testing is not the first of these. Jakob Nielsen’s promotion of the concept of discount usability testing sparked the change that created most of our jobs today, moving product design research out of corporate labs and into product-development organizations. Beyer and Holtzblatt’s contextual design helped get teams out of the labs and into the field. As a profession, we need to be able to communicate the pros and cons of these new methods and their associated tools, without glossing over important differences.

It’s important that we don’t discount so-called discount usability testing by calling it qualitative, even if doing so simplifies communications in the short term. This just perpetuates the misperceptions of people who don’t realize that the mainstream scientific community has long recognized the fact that the inclusion of qualitative factors is critical in both hypothesis formulation and the analysis of any data in science. Thomas Kuhn, the famous physicist and author, wrote in The Function of Measurement in Modern Physical Science:

“Large amounts of qualitative work have usually been a prerequisite to fruitful quantification in the physical sciences.”

In taking strictly qualitative or quantitative approaches to any problem, we lose the advantages of combining these approaches, making our data subject to questions of validity or interpretation. Such questions are the main reason many usability problems go unaddressed.

So What Am I Proposing?

We should be leveraging A/B test data along with data from other worthwhile methods
—including early studies, using small samples to gather qualitative and quantitative insights into what users do, why they do it, and what they really want to do.

Instead of positioning ourselves as quantitative or qualitative user research specialists, UX professionals should strive to become experts in the selection of appropriate research methods, including anything new that promises to make user research faster, cheaper, or better in general—like the new breed of remote user research tools that have hit the market in the past five years. My guess is that the recent discontinuation of TechSmith’s UserVue was a result of too many people jumping on the A/B testing bandwagon—without fully realizing what they were giving up.

As a profession, we have failed to explain clearly the value of small-sample studies to the people who award our budgets. We need to fix that. I just hope some promising new tools that allow us to collect rich data more efficiently can help us overcome the growing bias toward doing only A/B testing, which lacks explanatory value. Instead, we should be leveraging A/B test data along with data from other worthwhile methods—including early studies, using small samples to gather qualitative and quantitative insights into what users do, why they do it, and what they really want to do.

We should also keep our minds open to other new tools and methods, including new remote usability testing techniques that tools like Userlytics have enabled. Userlytics lets you gather rich qualitative data via video capture, in much the same way traditional usability testing does, but much more efficiently. Such tools can help us gather richer data—including behavioral data and verbal protocols—in about the same amount of time A/B testing requires, but much earlier in the development process, with the result that the costs of data-driven design iterations are lower.

Otherwise, we’ll continue to waste a lot of time and money building stuff just for the sake of A/B testing it. Ready, shoot, aim is not a recipe for success. Especially when you lack the qualitative insights to figure out why you hit what you hit.

From the Editor—I want to welcome our new sponsor, Userlytics, and thank them for asking Jon Innes, who works for them on a consulting basis, to write this article for UXmatters. Userlytics provides a comprehensive remote usability testing service that includes planning, participant recruitment, testing, and reporting. Their software captures a synchronized record of participants’ interactions with their computer, spoken remarks, and facial expressions.

References

Drucker, Peter. Management: Tasks, Responsibilities, Practices. New York: Harper Collins, 1973.

Kuhn, Thomas S. “The Function of Measurement in Modern Physical Science.” Isis, Vol. 52, 1961.

Mulder, Steve, and Ziv Yar. The User Is Always Right: A Practical Guide to Creating and Using Personas for the Web. Upper Saddle River, NJ: New Riders Press, 2006.

Nielsen, Jakob. “Putting A/B Testing in Its Place.” Alertbox, August 15, 2005. Retrieved February 19, 2011.

Sauro, Jeff. “A Brief History of the Magic Number 5 in Usability Testing.” Measuring Usability, July 21, 2010. Retrieved February 19, 2011.

Sauro, Jeff. “Why You Only Need to Test with Five Users (Explained).” Measuring Usability, March 8, 2010. Retrieved February 19, 2011.

Stevens, Stanley Smith. “On the Theory of Scales of Measurement.” Science, Vol. 103, No. 2684, July 7, 1946.

12 Comments

This is a great article. I�m surprised to see no comments yet. I would like to add couple of thoughts for discussion:

Apart from metrics, I have seen value in showing before and after designs for small-sample studies. This has helped execs quickly understand how iterative research helped shape the final product. You can take it a step further and quantify the iterative process by showing number of prototype iterations and design improvements.

Regarding adoption of new remote tools, it is the complexity of the tools that bothers me. We do a lot of remote research using GoToMeeting and a phone. We tried using advanced tools (UserVue, Morae), but the process is such a pain that we eventually fall back to simple technology. The irony is that most usability tools are not really usable. I hope Userlytics can change this and create value for researchers.

Jon, thank you for your great analysis of the situation in UX.

However, i should note that you’ve made an error in your description of the levels of measurement. Ratios between different values on interval scales are not meaningful. On the other hand, interval scales are enough to avarage values.

Here is what Wikipedia says about interval scales: “Ratios between numbers on the scale are not meaningful, so operations such as multiplication and division cannot be carried out directly. But ratios of differences can be expressed; for example, one difference can be twice another.”

Uggirala: I agree with you that showing before and after designs can be a powerful way of illustrating the value of iterative design. However, keep in mind, your audience may not always appreciate how much small changes can impact usability.

This is especially true because interaction-related elements of designs are hard to represent in the static mockups we typically use in presentations. Showing the designs along with some basic descriptive statistics about the associated task completion rates is much more compelling. It’s even better when you can show video clips of users actually using the different designs, which is what Userlytics allows.

Presentations showing some basic stats along with video clips of users interacting with the UI reinforce the concept of UCD. Such presentations also reduce the subjectivity of the argument that the changes you’re showing are improvements.

Regarding your comments on remote tools: I agree that, in the past, many of the tools in this space have been overly complex and hard to use. I’m encouraged by what I’ve seen from so far from Userlytics. The fact that they are thinking about this from a holistic perspective, including planning, recruiting, and reporting shows they are thinking about the UX of the tool itself.

Ivan,

Thanks for your comment. Just to clarify, I’m not saying you can treat interval as ratio data. My example was probably poorly worded and unclear. I was trying to explain that you can interpret differences on interval data sets—such as what is commonly collected via Likert scales—as meaningful.

As you noted in your quotation from Wikipedia, you can treat one difference as twice that of another when interpreting these questions. You can also calculate certain special types of averages from interval data, but you have to be more careful.

Not all assumptions that generally apply to advanced statistics hold with interval data. As you point out, we can use certain types of means with interval data—such as the arithmetic mean—but not the generalized mean, which many readers might confuse with the commonly used term average.

For anyone really interested in more coverage of this topic aimed at user research professionals, I highly recommend Tom Tullis and Bill Albert’s book Measuring the User Experience.

Very good article, indeed. I surely need to study more about this. Thanks a lot.

“This lets you say that users rating a feature a six are twice as satisfied as users rating the same feature a three.”

Well, you can say they “are twice as satisfied,” but it’s not really valid or meaningful. It’s nonsense to talk about ratios of satisfaction. But you can say that users rating a feature a six rated it twice as high as users rating the same feature a three, which should be just as useful.

Great analysis, Jon. QualitativeQuantitative data, on the other hand, is treated like gold.

While it’s true that most usability tests and many types of user research don’t reach statistical significance, this does not mean those studies cannot generate valuable quantitative data.

But for some reason, people often mistake the word quantitative for meaning statistically significant. I presume that this is why many UXers, among other professionals, unnecessarily shy away from using the term in many cases. In reality, it just means that the information is quantifiable.

As UX professionals, we’re only hurting ourselves if we dilute the business significance of the measurable data we can help to uncover. Before we can sell ourselves or our expertise, we first need to understand the vocabulary!

@DavidBardwell

It’s good to see this sort of depth in the field.

It’s important to understand what statistics really tell us. The basic assumption is that if it is statistically significant, it is true. Frankly, most statistical analyses are performed in violation of the assumption of statistical theory regarding sampling. So there is no mathematical basis for drawing conclusions from the data anyway. If you don’t have a random sample, you can’t extrapolate any findings to anyone other than the people who provided the data.

It all comes down to this: Whatever data we gather, whether quantitative or qualitative, is flawed. But it is the best information we have available.

Communicating why the information we provide is valuable is the key to our success in helping others understand that value.

Nick,

You make a great point. I added the original examples for categorizing the data during a last-minute editorial round. In the version we originally published, the definitions were a bit oversimplified. This reminds me of a quote often attributed to Einstein: �Make everything as simple as possible, but not simpler.�

After seeing the other comments above, I asked the UXmatters team to update the article to clarify the example for interval data.

You should avoid indicating anything is twice something else in size when talking about the underlying items you’re measuring on interval scales, because you don�t know the absolute zero point of the underlying item. That means you can�t talk about the size or amount of those items. However, there are some things you can say.

When studying three designs (A, B, and C), you can say that one design (B) had double the impact on satisfaction scores when compared to another design (C) and measured against the prior design (A) as a reference. That�s because you�ve defined a non-arbitrary reference point (the score for A) for the comparison.

Since that�s pretty confusing, I normally recommend a different approach. Just reporting exactly how many users responded for a certain rating works well. For example, if you were using a 7-point Likert scale, I�d recommend reporting how many users rated things at each point—for example, 10 users rated B a 7. This makes it easier to interpret when relatively more users respond with a better rating in a later study or when comparing a different design.

I�ll leave significance testing, sampling bias, and statistical power out of this discussion for now, but they need to be considered as I�m sure others will point out.

David,

I agree. It�s the combination of qualitative and quantitative data that�s really valuable. They say you can�t know something unless you can quantify it. However, I�d argue you can�t quantify something unless you understand the underlying qualities of whatever you are trying to measure.

Christian,

Thanks for your compliment. I respectfully have to disagree with your statement: �Whatever data we gather, whether quantitative or qualitative, is flawed.�

I�d say the problem is flawed analysis and research methods, not the data itself. The analysis of the data can be improved. It�s not the best it could be. Only once we, as a profession, recognize this is a problem can we fix it. And I�d hope we can reduce the amount of flawed analysis work in UX before our field loses credibility.

All science depends on our best understanding. However, UX research is not often performed with the same level of rigor as other types of research.

That�s largely why it�s often not seen as valuable. Nobody values an analysis that lacks credibility.

Join the Discussion

Asterisks (*) indicate required information.