Get hands-on practice in all the key areas of UX and prepare for the BCS Foundation Certificate.
I was at a 1-day conference last week about usability testing and over coffee I eavesdropped on a heated discussion I heard taking place about the number of participants needed for a usability test. One person insisted that, to find usability problems, 5 was the magic number. The other person insisted that, to have any statistical validity, you needed 20 or so.
The argument wasn’t resolved and the session restarted. This was a shame as I wanted to tell them that they were both right.
One problem with the ‘magic number’ discussion is that people don’t often realise they are making a category error: they are discussing different kinds of thing. One person construes a usability test as a stage in an iterative design process where the aim of the test is to find usability problems that can then be fixed. The other person construes a usability test as a measure of the efficacy of the design solution: is the system easy to use?
There is a place for both kinds of test but there are several differences between them and participant numbers is just one. I’m going to call the first kind of test ‘formative testing’ since its purpose is to shape or mould the user interface; and the second kind of test ‘summative’ testing since its aim is to summarise, like in a court of law. Now let’s look at the differences.
Let’s deal with this issue first. The ‘magic number 5’ question has been discussed to death with formative tests. My own view is you’ll learn something even if you test one user, so I’m happy to go on record and say that just one participant is better than none. But most people in the field would agree that Jakob Nielsen’s oft-quoted number of 5 is as good as any other. In fact, if your aim is to find the maximum number of usability problems, research shows that you’re better off increasing the number of tasks that participants carry out rather than focusing solely on the number of participants in the study, which makes me think that the whole number 5 issue has long outstayed its welcome.
But 5 is clearly insufficient for a summative test. Here you’ll want to calculate statistics like average task time, and although you can do stats on small participant samples, your analysis will be restricted to detecting very large differences. So to stand more chance of detecting small differences in summative tests, you should aim for a much larger participant sample: Jakob Nielsen recommends 20 for these types of test; when testing everyday products for usability, ISO 20282 recommends a sample size of 50. If you're used to 5-participant tests, these sample sizes might sound unachievable but in fact it's simple to achieve these numbers with remote, unmoderated usability testing.
But an important second difference between formative and summative testing is the methodology itself. With formative tests, you want participants to think out loud, describe what it is they are trying to do and let you know when they’re confused. That’s the moderator’s key role in a usability test: to listen and to remind the user to keep thinking aloud.
With summative tests, your main interest is in the statistics of participants’ behaviour: how long do they take on a task? Are they successful? How many errors do they make? So for summative tests, a moderator’s presence is a distraction. In fact, there’s an old joke that says you can run a summative usability test with one man and a dog. The dog’s role is to make sure the man stays quiet, and the man’s role is to feed the dog.
So, in contrast to formative tests, you’re better off running summative tests with participants working alone, either in a lab or remotely, over the Internet.
Both formative and summative tests can be run in a lab or they can be run remotely over the Internet. But in our experience, remote, unmoderated tests (think Loop11 or our own managed benchmarking service) are useful only for summative testing. This is because formative tests really need a moderator alongside the participant (either in a lab or virtually) to ensure they stay on task and keep thinking aloud.
Because of the different objectives of the two kinds of test, they also have different requirements for data collection. With formative tests, your aim is to find usability problems: do participants struggle with the system’s terminology? Can they navigate the system to achieve their goals? Do they understand the search results? One of my favourite ways of logging these issues is to delegate it: assuming I have a room full of observers, I’ll ask each one to note down each problem they observe on a sticky note. Completed sticky notes get placed on the wall and I’ve been known to offer a free prize to the person who spots the most.
But this approach won’t work with summative tests. Here you need an experienced, dedicated, data logger. This person will know exactly when the task officially starts and stops and will use these actions to trigger a timer (to get accurate measures of time on task). He or she will also have a list of criteria that defines task success, and the participant will need to tick every box to be judged successful on the task. By having a dedicated data logger, you can be sure these measures will be collected consistently for every participant in the study, making the usability metrics you calculate robust and reliable.
A further difference between the two types of test is in the frequency you run them and when in the development process you run them. On most projects, you should be running formative tests monthly (or during each sprint if you’re using an agile-based approach). You can use formative tests to assess what Jeremy Clark calls pretotypes, where you simulate the core experience of using a product with the smallest possible investment of time and money. Examples include paper prototypes, electronic prototypes and other minimum viable products. Formative tests with pretotypes will help you evaluate the assumptions you’re making about the system and help you fix usability issues while they are still cheap to fix — rather than waiting until release.
In contrast, summative tests are difficult to carry out unless you have a working system: measuring time on task with a paper prototype is a little meaningless. This means you’ll run fewer summative tests and you’ll probably run them towards the end of development. Having said that, you might still be running some summative tests at the very beginning of the project: for example, you’ll want to test earlier releases of the system or competitor products to provide you with benchmark values.
For formative tests, data analysis tends to be qualitative in nature. At the end of a day’s testing, I ask the observers who’ve been busily creating sticky notes to now organise them into groups. This helps us identify the underlying usability themes (such as confusing search results, poor content quality or lack of task focus). At a later meeting, we’ll assign each usability issue a severity rating and propose a suggested fix before entering each one into a bug list so that the development team gets it fixed. I might also do a back-of-the-envelope calculation of the success rate: even with a participant pool of 5, a task that none of the participants manages to complete probably needs more work than a task where everyone is successful.
For summative tests, data analysis requires some statistical work: you’ll need to calculate time on task, measures of success rate (and their associated variances) and perhaps an analysis of questionnaire data to get a measure of satisfaction. You might even combine these individual measures to create an overall 'UX score' for the system.
The final step is to present the results. For formative tests, this could take the form of a highlights video, a workshop, a list of issues in Excel or simply a meeting with observers after the last participant has left. We often create reports for clients but in my experience most of the value comes from the face-to-face meetings, so don’t feel bad about getting out of the deliverables business.
For summative tests, you’ll want to create a user experience dashboard so that you can monitor progress over time or compare the results with the competition. This is important for benchmarking user experience and using your results to influence business decisions.
Having this two test framework to refer to will help you make decisions about many aspects of planning a usability test, from how to present your data through to that old chestnut, the number of participants you should test. At the very least, it should act as a reminder that each test type is unique and specialised: this should help you avoid a ‘catch all’ approach where you try to get formative, summative and even marketing data from the same test ("We’ve got the customer in the lab — let’s get everything we can out of them!")
Dr. David Travis (@userfocus) has been carrying out ethnographic field research and running product usability tests since 1989. He has published three books on user experience including Think Like a UX Researcher. If you like his articles, you might enjoy his free online user experience course.
Gain hands-on practice in all the key areas of UX while you prepare for the BCS Foundation Certificate in User Experience. More details
Every month, we share an in-depth article on user experience with over 10,000 newsletter readers. Want in? Sign up now and get free, exclusive access to our reports and eBooks.
Our most recent articles
copyright © Userfocus 2018.
Get hands-on practice in all the key areas of UX and prepare for the BCS Foundation Certificate.
We can tailor our user research and design courses to address the specific issues facing your development team.
Users don't always know what they want and their opinions can be unreliable — so we help you get behind your users' behaviour.