User Research Doesn’t Prove Anything

By Steve Baty

Published: March 20, 2007

“In user research, as with all avenues of statistical inquiry, we’re able to demonstrate only that a hypothesis is probably true—or untrue—with some specific degree of certainty.”

Recently, I was reading through a sample chapter of a soon-to-be-published book. The book and author shall remain nameless, as shall the book’s topic. However, I was disappointed to read, in what otherwise appeared at first glance to be an interesting publication, a very general, sweeping statement to the effect that qualitative research doesn’t prove anything and, if you want proof, you should perform quantitative research. The author’s basic assumption was that qualitative research can’t prove anything, as it is based on small sample sizes, but quantitative research, using large sample sizes, does provide proof.

This may come as a shock to everyone, but quantitative research does not provide proof of anything either.

Here, I’m using the word proof in the mathematical sense, because that is the context within which the author made those statements. In mathematics, a proof is a demonstration that, given certain axioms, some statement of interest is necessarily true. The important distinction here is the use of the word necessarily. In user research, as with all avenues of statistical inquiry, we’re able to demonstrate only that a hypothesis is probably true—or untrue—with some specific degree of certainty.

Granted, I’m being pedantic; and you might think this just an interesting exercise in semantics. But let me take you through a brief survey of this topic, then perhaps you’ll appreciate the importance of this distinction.

Statistical Sampling

In general, our user research activities involve working with a small subset of our overall audience of users, to

  • gather information about a particular topic
  • test users’ response to some feature of our design solution
  • measure an increase or decrease in the efficiency of performing a certain task
  • or some other similar goal
“Our sample should reflect, as closely as possible, the composition of the full user population in the user characteristics that matter for our research.”

The size of the entire audience prohibits us from involving all of our users in our research activities.

Our first step is to select our sample from the total population of users. If we’ve done that successfully, our sample should reflect, as closely as possible, the composition of the full user population in the user characteristics that matter for our research.

For example, let’s say that we’re measuring the completion times for a set of tasks in a Web application. We should first think about the user characteristics that might make someone more or less able to complete the tasks. These characteristics might include manual dexterity—for mouse control—visual acuity or impairment, language comprehension skills, etcetera. We’re less interested in whether users are left-handed or right-handed, male or female. So when selecting our sample, we need to ensure that it represents the proportion of users with vision impairments, for example, rather than left-handedness. We refer to this attribute of the sample as its representation of the user population. In other words, our sample should be representative of the entire population.

There are a few other things we need to consider. All members of our overall user population should have an equally likely chance of our selecting them for our sample. This factor of sampling is known as randomness. So-called convenience samples—where we choose participants based on the fact that they’re close to us—obviously limit the likelihood of non-proximal users participating in our study, so don’t satisfy the requirement for randomness.

Lastly, our sampling technique should ensure that selecting one person has no affect on the chance that we’ll select another person. This factor is known as independence and is the same principle that describes the probability that a coin toss will result in a head or tail showing. The chance of getting a head on a single coin toss is half, or 50%. The chance that two heads will appear in a row is ½ x ½ = ¼. However, if our first coin toss shows a head, the chance that the second toss will show a head is back to being half, because the two events—our two coin tosses—are independent of one another. (Bear this in mind the next time you see a run of five black on a roulette table. You might hear someone say, “The next one must be red; the chances of having six black in a row are really low.” But really, the odds are fifty-fifty that the next one will be black.)

So, once we have selected a random, independent, representative sample, we carefully conduct our user research—survey, usability testing, etcetera—then measure our test results.

Inferential Statistics

There are two things we can do with our data:

  • Calculate summary or descriptive measures.
  • Use our sample statistics to estimate the values of those measures for the user population as a whole.

First, we usually calculate summary or descriptive measures such as

  • the arithmetic mean—what we commonly call the average
  • the mode—the most commonly measured value
  • the median—the middle value when we rank all measures

We also might measure the variance in a summary or descriptive measure—and a host of other values we can calculate from our sample. We collectively refer to these as sample statistics.

The mistake that researchers often make is to stop at this point and start talking as if we now have learned something definite about our user population as a whole. Statements such as the following are all complete nonsense:

  • “78% of users think…”
  • “85% of teenagers on MySpace believe…”
  • “The majority of users will…”
“It is a remarkably simple process to make this jump from a sample to an entire user population.”

The simple words “Of the users who completed the task/survey question/etcetera…” should preface all such statements.

Of course, I said there were two things we can do with our data. The second is to use our sample statistics to estimate the values of those same measures—mean, variance, etcetera—for the user population as a whole. We do this, because, in most cases, our intent is to learn something about our entire user population.

It is a remarkably simple process to make this jump from a sample to an entire user population. The key thing to understand, in taking this step, is that, if we took a second sample and measured the same values—mean, variance, etcetera—we’d expect to get numbers that were close to, but not exactly the same as our first sample. In fact, if we were to repeat this process over and over, the values we measured for the mean and variance would form a standard bell-shaped curve like that shown in Figure 1—called variously the normal distribution or Gaussian distribution—after the 19th century German mathematician who used it so extensively in his work on astronomy.

Figure 1—A bell-shaped curve

Bell-shaped curve

This characteristic of sample statistics provides us with the means by which we can estimate statistics for an entire user population from a single sample. We can use the sample mean directly—as an estimate for the population mean. Because the sample means form a normal distribution, we know that the actual population mean will fall within a plus or minus range around our sample mean. That plus or minus range is based on the variance in the sample. When using the sample variance to estimate the population mean, we first transform it into a standard error, using the formula:

Formula

where s is the standard deviation for the sample—that is, the square root of the variance—and n is the number of users in our sample.

Based on our sample mean, we can be 95% certain that our actual population mean will sit within the range x ± 2se, where x is the sample mean—to be precise, the actual range is x ± 1.96se for a 95% range. We can be 99.7% certain that the actual population mean will sit within the range x ± 3se.

Note that, in each case, there exists a small chance that the actual population mean will fall outside the range that we’ve defined. This chance is why I referred earlier to the distinction between something probably being true and something necessarily being true.

“In the absence of such techniques, the conclusions you draw from your research lack adequate foundation, and those reviewing your reports can easily reject them.”

Also, our standard error (se) is a function of the number of users in our sample. The more people we sample, the smaller the value of se and the narrower the range we define for our population mean. However, because we use the number of sample users as a square root, to halve our estimated range, we need to sample four times as many users. This is why people tend to say that quantitative studies require more users than qualitative studies. Because we’re trying to minimize the range of our estimate, we want to minimize the value of the standard error.

The techniques I’ve described work for all measured values—for example, task completion times—and allow you to start making more meaningful statements about your user research data. In the absence of such techniques, the conclusions you draw from your research lack adequate foundation, and those reviewing your reports can easily reject them. More importantly, the efforts you expend in conducting your research will be largely wasted for want of some simple analytical rigor.

The estimated range we can provide for the population mean gives us a reasonable likelihood—95%, say—that the real population mean will actually fall within the bounds of that range. However, not only have we been unable to pinpoint exactly the population mean, there still exists a slight chance—5% in this example—that the actual value will fall outside this range. More importantly, that uncertainty exists regardless of the number of users we include in our test. All we can do is narrow the range and, in so doing, get closer to the real value for the entire user population. So quantitative studies, while providing us with a method for estimating user population statistics, cannot provide us with proof. Used carefully, however, they can tell us a great deal—and if not with certainty, at least with a known amount of uncertainty.

15 Comments

I think you’re misunderstanding the distinction between qualitative and quantitative measurement and evaluation. In the social sciences at least, this distinction is about how something is measured or evaluated. It’s not about sample size. It’s about methods.

Or perhaps you’re just repeating the original author’s misunderstanding of the distinction?

“Here, I’m using the word proof in the mathematical sense…” Technically, you’re using it in the statistical sense. Probability theory is only one of many forms of mathematics.

Beyond that point, I think your overall point would have been stronger if you’d used Popper, who argued that science can never prove, only disprove. There is always an alternative explanation, however implausible. The point is to reduce the number of plausible explanations by asking falsifiable questions.

Qualitative techniques can be used to falsify hypotheses, just as quantitative approaches can. The key difference relates to generalizability. Given the small sample sizes common to qualitative research, statistical inference is often problematic, but not impossible. But generalizing from larger samples is not often so straightforward either.

Normal (Gaussian) distributions are the foundation of the most commonly used statistical tests—for example, t-tests. But non-parametric data is not distributed so evenly. For instance, wealth—depending on the society—tends toward Zipf (Power Law) distributions. Another example would be humans’ sensitivity to sound, which is generally logarithmic—the 11 on my amplifier is ten times louder than the 10, and so forth.

Non-parametric data can often be made to work more or less accurately with more common statistical approaches by converting the data to linear forms. But this requires knowledge of the real underlying distribution. But because data are so often assumed to be normally distributed, this step is often left out. This often results in under- or over-estimations of statistical impact.

For instance, we might believe that a bi-modally distributed (u-shaped) variable was non-responsive, because there is no linear relationship observed. Yet if we were to transform that variable into linear form—leaving aside the question of missing variables—we may well discover it was statistically significant, using parametric tests. But again, unless we look carefully at its real-world distribution pattern, we would have very little reason to suspect this to be the case.

It was always my understanding that there are inferential techniques that are appropriate for normal distributions, but since we often deal with distributions that are not normal, like Zipf/Pareto-distributed curves—that is, the long tail—are there techniques that deal with this type of distribution?

While I am entirely for qualitative research over quantitative research—I love how our industry is full of just slightly different words with an enormous difference in meaning, InteracTION, InteracTIVE, QUALitative, QUANTitative—I think we also need to be aware that on the other side of the spectrum are the “designers who just know,” and they don’t like any kind of research, but instead claim they innately know everything qualitative research can provide.

So here is qualitative research, stuck between a rock and a ruler.

So for those of us who do very simple UI reviews, because of budget, time, and resources, are we getting value?

I’ve found that a very small sample—3-5 people—many times pops out major issues that we had been missing. When those are corrected, another small group seems to clean up most of the others.

Then our beta release to a limited audience seems to identify most of the rest.

Any suggestions on how to manage this type of UI review for best results?

Mr. Carlson, it seems entirely appropriate to me to run beta—and pre-beta—tests with a limited pool of users, for exactly the reasons you mention. Few developers will be able to anticipate user reaction to every feature, especially when these are innovative or unusual. I would want to choose testers who are “normal” in terms of the expected user base. I’d focus on whether they are expected to be within a certain age range, disabled, or non-expert or expert users. This should give you a foundation for logical inference. whether or not you choose to expand the research to support statistical inference would likely depend on the size of the expected user base and the expected risks associated with a sub-optimal interface. I suspect that, most of the time, logical inference is enough, but I might want more tests if the application were going to be used for a critical application—a control system for a nuclear power plant, perhaps.

Firstly, thank you all for your comments. To respond to the specific issues you’ve raised…

Benjamin: I was referring to the distinction as being methodological, but the author’s original statement carried with it an assumption based on the practicalities of user research: That qualitative studies tend toward smaller sample sizes. I see where that might not have been clear to readers and thank you for the clarification.

Ken: I was referring to the definition of the term proof in the context of a mathematical proof—one that follows necessarily from a set of initial conditions.

I do like your reference to Popper. His proposition relates to the general arena of scientific inquiry and was a reaction against the rise of empiricism and the scientific method as the means of learning truth; whereas, I was somewhat more humbly addressing the narrower issue of inferring knowledge of a user population based on the observed or measured behavior of a sub-set of that population.

Your examples of the issues surrounding non-normal distributions are well made. I chose to concentrate on normal distributions for several reasons, but mostly because they are so familiar to most people and would, hopefully, make the main theme of the article more approachable to readers.

Gino: Yes there are a variety of techniques available. So many, in fact, that no answer I gave here would be sufficient. Myles Hollander and Douglas A. Wolfe’s Nonparametric Statistical Methods, 2nd Edition, remains a good reference for such techniques. Although it may be heavy going for the layperson.

Riaz & Dick: I am by no means arguing against the use of qualitative research techniques. They can be invaluable, and in exactly the scenario Dick describes. From your description of your process, your approach is an excellent application of qualitative usability testing, and one which no doubt provides you with good results.

A good introduction to statistical methods, but what are the quantitative methods? And what are the distinctions between them?

I thought your article was to correct the misunderstanding of the original author and to point out what are the pros and cons of quantitative and qualitative methods in usability studies in your own view. It seems your article has lost its focus.

You have identified one of the two most common misconceptions regarding design research. Yours, of course, that research proves something, as opposed to indicating something.

The second is that research will make decisions for you. Regardless of the type of research, it has to be interpreted and the application thought through. Research should never make design decisions for you.

I find this article very intriguing, because it touches on the subjects of logic, argumentation, and rhetoric. I think the user experience field could benefit from better understanding those disciplines. Mathematical proofs are essentially deductive logic, while scientific methods and reliance on empirical evidence are a part of inductive logic / argumentation. From an argumentation perspective, quantitative and qualitative evidence are, in the end, both evidence that can have varying degrees of rigor. I think it’s important to heed Baty’s reminder not to overstate what we can conclude from quantitative evidence. I’d add it’s important to be aware of the rhetorical elements involved in any argument, no matter how objective it seems. (To that end, some colleagues and I find the works of David Zarefsky very useful.)

From that perspective, I find both quantitative and qualitative evidence highly valuable in developing user experience strategy and solutions. The value of quantitative evidence is already well articulated here, and some benefits of qualitative evidence have been noted. I’ll simply add that qualitative evidence helps illuminate the why behind a user experience phenomenon, which can be a rich source of ideas for new strategies and solutions.

Stef: The article was intended to address a lack of understanding of the nature of inferential statistics, in particular the process of estimating population parameters based on test samples, and to highlight the gap we must bridge when drawing our conclusions. I was, by no means, attempting to discuss or debate the relative merits of quantitative and qualitative research methods.

Colleen: Zarefsky is an interesting reference in this context. Do you have a specific article or publication in mind you’d like to share?

I’ve found Zarefsky’s introductory works, especially his lecture series for The Teaching Company, extremely valuable. See Argumentation: The Study of Effective Reasoning. He summarizes and explains the basics of argumentation and rhetoric very effectively. (I believe his specialty is argumentation and public policy/government, so those works aren’t as applicable.) Sorry for the delayed reply!

“… science can never prove, only disprove.”

Agreed. You can’t prove an alternative hypothesis. However, you can show that the null is less likely than one or more of the alternatives.

Really nice article. Keep it up! Very helpful for my work. Greetings.

There is one problem when someone makes the statement that you can’t prove anything. If this is so, then you can’t prove that you can’t prove anything, which means that you can prove something.

Join the Discussion

Asterisks (*) indicate required information.