Sample Size Oddities
Published: November 17, 2008
It might seem counterintuitive, but the larger the proportion of a population that holds a given opinion, the fewer people you need to interview when doing user research. Conversely, the smaller the minority of people who share an opinion, the more people you need to interview.
Mariana Da Silva has written an article about sample sizes in market research—or user research—titled “The More the Merrier.” In the article, Mariana made a comment that has caused some consternation—and for good reason.
“It all comes down to the size of the effect you intend to detect. Imagine you wanted to know whether people in London are taller than people in New York. If people in London and people in New York are actually pretty much the same height, you will need to measure a high number of citizens of both cities. If, on the other hand, people in London were particularly tall and people in New York were shorter than average, this will be obvious after measuring just a handful of people.”—Mariana Da Silva
Surely, popular thinking went, the larger the difference, the more people you’d need to ask to make sure it was real? It makes intuitive sense, but ignores the underlying principles of probability theory that govern such situations.
Now, before there’s a stampede for the exit, this article is not going to be heavy on mathematics, probability, statistics, or any other related esoterica. What we’re going to do is take a look at the underlying principles of probability theory—in general terms—and see how we can make use of them to understand issues such as the following:
- how many people to include in a usability test
- how to efficiently identify population norms and popular beliefs
- how to do quick-and-easy A/B test analysis
Then we’ll move on to take a look at a case study that shows why a large sample size doesn’t always guarantee accuracy in user research, when such situations can arise, and what we can do about it.
Understanding Optimal Usability Test Size
Across the usability landscape, conventional wisdom holds—as characterized by the title of Jakob Nielsen’s Alertbox article from 2000, “Why You Only Need to Test with 5 Users”—you can do usability testing with just a handful of users. With more, you’ll see diminishing returns on each successive test session, because it is likely that another user will already have found the bulk of the issues a user finds.
Nielsen provides the reasoning that each user—on average and in isolation—identifies approximately 31% of all usability issues in the product under evaluation. So, the first test session uncovers 31% of all issues; the second, 31% of issues, with some overlap with session 1; and so on. After five test sessions, you’ve already recorded approximately 75-80% of all issues, so the value of the sixth, seventh, and subsequent test sessions gets lower and lower.
Identifying Norms and Minority Views
There’s another way to explain this observation: Some problems are more widespread, or are experienced by more people, than others. Because we choose users at random for our usability tests, the more prominent problems are the ones that are likely to show up early and repeatedly.
In other words, as we do tests with more users, we not only learn about what issues people experience, but if we look at the overlap between the issues users find—even with only a handful of users—we gain an understanding of which problems are likely to be the most widespread among the target audience. So, even with a very small test base, we can be reasonably sure we’re identifying the problems that will affect the biggest proportion of the user base.
We can use the same principle to identify new features or changes to a product that would appeal to the most people—a principle ethnographers use to identify population norms among cultural groups. If we ask a small group of people—selected at random—what product changes they’d like to see, the most popular suggestions from the entire user population are the ones with the biggest probability of appearing in any small group of users you’ve chosen at random.
But this also highlights a danger of small sample-size tests and surveys: Minority voices don’t get heard. The issues that affect small segments of the target population are less likely to show up in a small random sample of users—and so, you’re more likely to miss them.
If your user research needs to include the voices of minority segments within your overall audience, it is important to plan for this ahead of time. There are a number of different options at your disposal:
- When selecting your test or survey participants, ensure that you include at least some participants who represent each minority segment. We sometimes refer to this as a stratified sample.
- Run tests or surveys using a lot more participants. This also has the advantage of reducing the overall level of error in your test data.
Let us now return to the subject of the Mariana Da Silva quotation. Why is it that we don’t have to measure as many people if heights differ greatly between the two populations? Don’t we still need to measure a decent-sized sample, calculate averages and confidence intervals, then carry out some sort of significance tests?
The short answer is: No.
If the two populations are very different—in terms of their distributions of heights—it’s likely we’ll very quickly see that difference reflected in the mean and standard deviation of our test data. For example, let’s assume we’ve measured the heights of ten men from each city and found that the average is different by 10 centimeters, or 4 inches. That’s a large observed difference. But what can we conclude from that? Our initial response might be that it’s likely just an anomaly in the sampling—we just picked taller Londoners.
However, as we measure more people from each city, and the height differential continues to appear, the likelihood that the difference is random chance becomes smaller and smaller very quickly. It just isn’t very likely that we’re randomly, but consistently choosing to measure abnormally tall people in London—or choosing abnormally short people in New York.
Now compare this to what happens when an observed difference is very small. With a small difference, it remains plausible longer—with a much, much larger number of people—that it’s due to random chance. Therefore, we need much larger sample sizes before our statistical analysis can conclude that the difference is real.
Analyzing A/B Tests
Testing or surveying with large numbers of people can be difficult, costly, and—as I mentioned previously—not necessarily valuable. Although there are some forms of testing that overcome many of these issues—such as A/B testing or online, self-administered surveys. In A/B testing particularly, we can also apply some of the principles I discussed previously to reduce the length and size of a test.
In the early stages of an A/B test, a large difference in performance could be an indicator of a substantial difference between two designs. For example, we might run an A/B test on a Web site and record the results of a small number users—from 100 to 200 users. If we observe a large difference, it’s time to shut down the test. If we observe a small difference, we should continue running the test until we’ve observed the behavior of 2,000 to 5,000 users and have brought formal analysis techniques to bear.
Large sample sizes are no guarantee of accuracy, however. A recent California election offers a case in point that is worth reviewing.
Proposition 8: A Case Study in Surveys
In an election on November 4, 2008, the people of California voted on Proposition 8, a ban on gay marriage. As people cast their votes and left the polling centers, exit polls—which are a type of field survey—recorded how 2,300 people had voted just moments before.
Exit polls for Proposition 8 showed a majority (52%) voted against the proposition. Younger people were more likely to be against the proposition than the elderly; college-educated people were also more likely to be against it; and people without a college education were more likely to be in favor.
More importantly, a sample size of 2,300 for the exit polls reduced the survey error down below +/- 2%, suggesting a clear likelihood of defeat for Proposition 8.
As the polls closed and the count started to come in, the actual data was completely different. The exact opposite, in fact. When the votes had all been counted, 53% of the population had endorsed Proposition 8—well outside the sampling error of the exit polls.
Clearly something was going on with the survey.
There are a number of different phenomena that may have contributed to this strange result:
- The exit polls may not have been representative of the overall population of California.
- Some people may have reported a vote that was different from the one they actually placed.
While the first case is possible, and we can’t exclude it from our consideration, the second case is also possible—and is much more interesting. When confronted by a survey question that touches on topics of some sensitivity or areas of social taboo, people are more likely to choose the response that represents—in their own minds—the answer the interviewer wants to hear. Examples of such topics might include sexual practices, drug taking, needle sharing, and criminal behavior. In surveys—particularly face-to-face surveys—the prominence of an undesirable activity or behavior tends to be under-reported, unless survey designers and interviewers take great care to ensure that doesn’t happen.
One option is to make it very clear to each respondent that accuracy is important, and there is no right answer. Another option is to let people answer a question in a manner that is more likely to remove the desire to please the interviewer—that is, either through a self-administered survey or—as in this case—using the confidential mechanism of an election.
And this, I believe, is why we witnessed such a variance between the exit poll and the actual voting. When confronted by a real person, asking how they voted, people responded by giving what was ostensibly the right answer—that they had voted No on the proposition. However, inside the polling booth, where they had privacy, they voted the way they really felt: Yes.
Usability testing doesn’t always have to be a compromise between certainty and sample size. By understanding the underlying principles at work, we can design our user research to make the most efficient use of our available time and energy and still achieve meaningful results.