Reliability and Dependability in Usability Testing
Published: June 24, 2011
Usability testing is a form of user research, in so far as it allows you to make conclusions about a large population based on observations of a small sample of that population. Essentially, we try to assess our products’ suitability for our marketplace—as well as its usability for the population of interest—by testing it with a group of typical users. Usability testing often involves both quantitative and qualitative data—either of which can be subject to misunderstandings. This column discusses principles of rigorous research as they apply to usability testing, with an emphasis on reliability and dependability.
Attributes of Rigor
The classic attributes of rigor in research are
- internal validity—Are you measuring or observing the right things based on what you have set out to study?
- external validity—Do the conditions of your study represent what would happen in the real world?
- reliability—Can you apply what you observed in the small sample to the larger population?
Anselm Strauss and Juliet Corbin, two social-science researchers, have noted that these attributes of rigor are grounded in quantitative methods, often causing consumers of qualitative research to deem it less reliable than quantitative research. Therefore, they developed three corresponding attributes to distinguish qualitative rigor, calling them credibility, transferability, and dependability, respectively. This column focuses on the areas where I see the most confusion in usability testing: namely, the reliability of quantitative data and the dependability of qualitative data. For a fuller discussion of all of these attributes, please see my book A Research Primer for Technical Communication.
An attribute of quantitative research, reliability assesses how confidently you can assume that the results you saw in your sample accurately represent the results you would see in the larger population of interest. You assess reliability through methods of inferential statistics, which look at the statistics measured, the sample size, and the variability of the data and calculate a probability that the same result would occur consistently across the larger population.
For example, you might test an installation procedure and calculate the average time research participants took to install a product. You might then make some design changes, test them with a different group of participants, and find that the average installation time is less. Inferential statistics look at differences in the averages, the sample sizes, and the variability of the data to calculate the odds of getting similar differences when evaluating the same version of the design simply based on the chances of getting some slower users in your first sample and some faster users in your second sample. You might be really excited about the possibility that your changes have reduced the average installation time until you find out that there is a 35% chance of getting the same result just by testing the original version with two different sample groups.
The mistake many usability professionals—and almost all clients—make is that they confuse descriptive statistics with inferential statistics. Descriptive statistics are calculations like averages or ranges for the sample group that apply only to the sample group. For example, you could calculate the average age of your sample group. But that is not necessarily a reliable indication of the average age of your users—at least you wouldn’t know whether it is until you applied the appropriate inferential method. Because many usability tests use small sample sizes, we cannot use their descriptive statistics as reliable indications of what is going on in the population at large.
Do not use usability data to make inferences about the larger population unless you have employed an appropriate inferential statistical method to determine the reliability, or statistical significance, of the findings. Avoid discussing the descriptive statistics with a client who might jump to unreliable inferences. For example, if I do a comparative study and the revised version seems to be better or faster, but the results are not statistically significant, I do not communicate that result to my client. If the client asks, I typically answer that I can’t tell from such a small sample.
Often, the most informative data from a usability test are qualitative. But as Corbin and Strauss have found, there is an inherent distrust of conclusions from qualitative data because of their subjectivity. You may get the objection, “How can you make a decision based on such a small sample size?” For one, sample size does not affect the dependability of qualitative data. Often, a sample of one can give dependable results—as I’ll show in my discussion of a phenomenon called the click of recognition.
What usability professionals and clients need to understand is that qualitative studies are typically illuminative—that is, they provide insight and understanding about the problem. We carry our own biases into a study based on our in-depth knowledge of a product or on our personal world view. Users act as lenses, helping us filter out our preconceptions. One of the strengths of usability testing is that it lets us see our product through fresh eyes.
But qualitative studies are interpretive—that is, we still must take in our observations of users through our own filters and biases. One of the best safeguards against this limitation of qualitative studies is relying on multiple observers or multiple methods of research to get at the same information. Using multiple research perspectives to look at the same thing is called triangulation of data. For example, if three different observers watch the same user and conclude that the user is frustrated because the Submit button is outside their view, that is a more dependable conclusion than if only one observer arrived at that conclusion.
These concepts of user-as-lens and research data being illuminative rather than probative—that is, proving or disproving a hypothesis—discounts the need for large numbers of participants to make a finding dependable, as the following discussion illustrates.
The Click of Recognition
Louise Kidder, an authority on qualitative research, describes a phenomenon qualitative researchers sometimes experience that she calls the click of recognition. In the context of usability testing, we can recognize this as hearing a user say something or seeing a user do something and having a light go on in your head, figuratively. We have a clarifying moment or epiphany when we slap ourselves on the forehead and say, “Of course!” These Aha! moments occur because, in usability testing, participants let us see an application through fresh eyes. A widget or message that seemed crystal clear to us suddenly becomes vague or ambiguous when we see it from someone else’s frame of reference.
Let me give some examples from a writer’s perspective, then I’ll describe a hierarchy of clicks for dependability.
Let’s say I’ve written something and give it to my wife to look at. She sees a misspelled word and points it out. Do I say, Thanks, but let me have twelve other editors look at it, too? No. It’s clearly wrong, and I can see it’s wrong. I was just too close to it when I wrote it and didn’t catch it. Her fresh eyes did, and I make the change based on an n of 1. In usability, the equivalent is a user interface bug where I failed to apply a known and widely accepted best practice. It takes just one user’s stumbling on it to trigger a click of recognition.
Now, my wife keeps reading and comes across this sentence: Tom told Dick to fire Harry, and it made him mad. She says, “I’m a bit confused about which of these characters you mean by him. Was Tom mad because he had to tell Dick how to do his supervisor’s job, was Dick mad because Tom was making him do the dirty work, or was Harry mad because he was getting fired?” What I was referring to was obvious to me when I wrote it, because I knew the details. But now that I have my wife as a lens, I can see how ambiguous the referent is. Do I need to get another opinion? No, now that I can feel someone else’s reasonable confusion, I have a click of recognition.
But then she goes on to say, “By the way, Times New Roman is so yesterday, you should use a different font.” I thank her and make a mental note to get some other opinions about that before changing it. No click.
The following is hierarchy of clicks, starting at the most concrete and dependable level and moving to the more abstract:
- You knew better. It was just a mistake you didn’t catch. Some examples include links that don’t go where they are supposed to go and misspelled words. There’s no need to contemplate whether to make the change—you just do it.
- You were seeing the application through your deeper understanding of its structure. Common instances at this level include elements of a user interface that initially seemed obvious to you, but which suddenly look vague or nonintuitive once you see a user stumble. For example, you might show someone’s name on a Web page as a hyperlink, so a user can send an email message to that person by just clicking the name. But during a usability test, a participant clicks the link, voicing her expectation that she’ll navigate to a bio about the person. When her email client opens, it surprises her. Based on that one observation, you decide to add an email icon to clarify the link.
- You were seeing the application through your own world view. Seeing a context from the different perspective of even one user can lead to a click of recognition. For example, for many years, I quoted a statistic about abandoned shopping carts in ecommerce. I cited it as an example of how poor usability was resulting in lost sales. During a usability test, however, I saw that, for a configurable product such as a notebook computer, the only way a user could price and compare options was to go through the purchase work flow. My world view of shopping cart and cash register had made me believe a design flaw was blocking a committed decision to buy. In my real world experience, someone with a product in a shopping cart, standing in line at a register, has made a decision to buy that thing. People don’t put items in a cart, then go to a cash register for a price check. But that is exactly what was happening in this ecommerce scenario. So, for one thing, the problem was not as serious as I had thought—the person had not committed to buy. And the solution was very different—making it easier to price and compare.
I saw a similar phenomenon to the click of recognition when I was doing my doctoral research on cross-functional teams conducting usability tests. In that case, the team would watch a user struggle with a feature, then someone would say something like “I had that same problem.” Others would readily admit to having their own struggles with the same feature, but no one had ever brought it up before. The problem was that the members of the team had individually discounted themselves as dumb users when they had made that mistake. Seeing a user make the same mistake made it okay to admit to making it and talk about it.
There is an important difference between this phenomenon—which I call empathetic validation—and the click of recognition. With a click of recognition, your world view suddenly gets shifted, as though the believability of the data is so compelling it’s like having a bucket of water thrown in your face. On the other hand, an empathetic validation reinforces a belief you already held. This doesn’t mean it lacks dependability, but it is your filter picking up on something the user says or does that aligns with your current world view.
In the case of the subjects in my doctoral study, one of the things that added to the dependability of the data was the concurrence of multiple members of the team who had encountered the same issue. I would advise a bit of caution, however, if you are observing someone alone and experience an empathetic validation. Challenge it a bit, and look for other corroborating evidence before you depend too much on it.
Even experienced usability professionals can get ahead of their data—or at least let their clients overextend its reliability. Do not discuss descriptive statistics with a client until you have applied the appropriate inferential tests to see whether they allow you to make reliable predictions about the larger population of users. And when working with qualitative data, it will sometimes be hard to explain to someone accustomed to the classical experimental model of research why you are willing to change a design based on a small number of observations. The easiest and best solution to that problem is to invite critical stakeholders to be first-hand observers during a study. People can understand a click of recognition more easily if they experience it themselves rather than through your explaining it to them.
Kidder, L. H. “Face Validity from Multiple Perspectives.” In D. Brinberg and L. Kidder, eds., New Directions for Methodology of Social and Behavioral Sciences: Forms of Validity in Research. San Francisco: Jossey-Bass, 1982.
Hughes, M., and G. Hayhoe. A Research Primer for Technical Communication: Methods, Exemplars, and Analyses. New York: Lawrence Erlbaum Associates, 2008.