## The Starting Point: The Need for a Measurement

Are you more comfortable working with qualitative data than quantitative data? If so, you’re like most UX people—including me. Once we’ve seen three or four test participants in a row fail for the same reason, we just want to get on with fixing the problem.

But sooner or later, we’ll have to tangle with some quantitative data. Let’s say, for example, that we have this goal for a new product: *On average, we want users to be able to do a key task within 60 seconds.* We’ve fixed all the show-stoppers and tested with eight participants—all of whom can do the task. Yay! But have we met the goal? Assuming we remembered to record the time it took each participant to complete the task, we might have data that looks like this:

Participant |
Time to Complete Task (in seconds) |
---|---|

A |
40 |

B |
75 |

C |
98 |

D |
40 |

E |
84 |

F |
10 |

F |
33 |

H |
52 |

To get the arithmetic average—which statisticians call the *mean*—you add up all the times and divide by the number of participants. Or use the AVERAGE formula in Excel. Either way, the average time for these participants was 54.0 seconds. Figure 1 shows the same data with the average as a straight line in red.

So, can we relax and plan the launch party?

Well, maybe. If our product has only eight users, then we’ve tested with all of them, and yes, we’re done. But what if we’re aiming at everyone? Or, let’s say we’re being more precise, and we’ve defined our target market as follows: *English-speaking Internet users in the US, Canada, and UK.* Would the data from eight test participants be enough to represent the experience of all users?

## True Population Value Compared to Our Sample

Our challenge, therefore, is to work out whether we can consider the average we’ve calculated from our sample as representative of our target audience.

Or to put that into Tullis and Albert’s terms: in this case, our average is the statistic, and we want to use that data to estimate the *true population value*—that is, the average we would get if we got everyone in our target audience to try the task for us.

One way to improve our estimate would be to run more usability tests. So let’s test with eight more participants, giving us the following data:

Participant |
Time to Complete Task (in seconds) |
---|---|

I |
130 |

J |
61 |

K |
5 |

L |
53 |

M |
126 |

N |
58 |

O |
117 |

P |
15 |

Then, we can calculate a new mean.

Oh, dear… For this sample, the arithmetic average comes out to 74.6 seconds, so we’ve blown our target. Perhaps we need to run more tests or do more work on the product design. Or is there a quicker way?

## Arithmetic Averages Have a Bit of Magic: The Central Limit Theorem

Luckily for us, means have a bit of magic: a special mathematical property that may get us out of taking the obvious, but expensive course—running a lot more usability tests.

That bit of magic is the *Central Limit Theorem*, which says: If you take a bunch of samples, then calculate the mean of each sample, most of the sample means cluster close to the true population mean.

Let’s see how this might work for our time-on-task problem. Figure 2 shows data from ten samples: the two we’ve just been discussing, plus eight more. Nine of these samples met the 60-second target, one did not. The data varies about from 10 to 130 seconds, but the means are in a much narrower range.

The chance that any individual mean is way off from the true population mean is quite small. In fact, the Central Limit Theorem also says that means are normally distributed, as in the bell-curve normal distribution shown in Figure 3.

Normal distributions also have very convenient mathematical properties:

- Two things define them:

- where the peak is—that is, the
*mean*, which is also the most likely value - how spread out the values are—which the
*standard deviation*—also known as*sigma*—defines

- where the peak is—that is, the
- The probability of getting any particular value depends on only these two parameters—the mean and the standard deviation.

Figure 4 shows two normal distributions. The one on the left has a smaller mean and standard deviation than the one on the right.