Measuring user satisfaction
A common mistake made by novice usability test moderators is to think that the aim of a usability test is to elicit a participant's reactions to a user interface. Experienced test moderators realise that a participant's reaction is just one measure of usability. To get the complete usability picture, we also need to consider effectiveness (can people complete their tasks?) and efficiency (how long do people take?).
These dimensions of usability come from the International Standard, ISO 9241-11, which defines usability as:
"Extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use."
The ISO definition of usability makes it clear that user satisfaction is just one important dimension of usability. People may be well disposed to a system but fail to complete business-critical tasks with it, or do so in a roundabout way. The three measures of usability effectiveness, efficiency and satisfaction are independent (PDF document) and you need to measure all three to get a rounded measure of usability.
Importance of collecting satisfaction measures
A second mistake made by people new to the field of usability is to measure satisfaction by using a questionnaire only (either at the end of the session or on completion of each task). There are many issues to consider when designing a good questionnaire, and few usability questionnaires are up to scratch.
For example, we've known for over 60 years that you need to avoid the "acquiescence bias": the fact that people are more likely to agree with a statement than disagree with it (Cronbach, 1946). This means that you need to balance positively-phrased statements (such as "I found this interface easy to use") with negative ones (such as "I found this interface difficult to navigate"). So it's surprising that two commonly used questionnaires in the field of usability the Usefulness, Satisfaction, and Ease of use (USE) questionnaire and the Computer System Usability Questionnaire (CSUQ) suffer from just this problem: every question in both of these questionnaires is positively phrased, which means the results from them are biased towards positive responding.
Questionnaires that avoid this source of bias often suffer from other sources of bias. For example, few undergo tests of reliability. This means that the same questionnaire may yield different results at different times (this can be checked by measuring the questionnaire's test-retest reliability). Even fewer usability questionnaires are assessed for validity. This means that there is no guarantee that the questionnaire actually measures user satisfaction.
Problems with measuring satisfaction
In our studies, we notice that participants tend to rate an interface highly on a post-test questionnaire even when they fail to complete many of the tasks. I've spoken to enough of my colleagues at conferences and meetings to know that this problem is commonplace. Is this because we are about to give the participant £75 for taking part in a test session or is there something else at work? For example, one group of researchers makes this point:
"In studies such as this one, we have found subjects reluctant to be critical of designs when they are asked to assign a rating to the design. In our usability tests, we see the same phenomenon even when we encourage subjects to be critical. We speculate that the test subjects feel that giving a low rating to a product gives the impression that they are "negative" people, that the ratings reflect negatively on their ability to use computer-based technology, that some of the blame for a product's poor performance falls on them, or that they don't want to hurt the feelings of the person conducting the test." - Wiklund et al (1992).
Once you ask participants to assign a number to their experience, their experience suddenly becomes better than it actually was. We need some way of controlling this tendency.
The Microsoft Desirability Toolkit
There are alternatives to measuring satisfaction with a questionnaire. A few years back, researchers at Microsoft developed the "Desirability Toolkit" (Word document). This comprised a series of 118 "product reaction cards", containing words like "Consistent", "Sophisticated" and "Useful". On completion of a usability test, participants were asked to sort through the cards and select the five cards that most closely matched their personal reactions to the system they had just used.
The five selected cards then became the basis of a post-test guided interview. For example, the interviewer would pick one of the cards chosen by the participant and say, "I see that one of the cards you selected was 'Consistent'. Tell me what was behind your choice of that word".
I've used this approach in several usability studies and what has struck me is the fact that it helps elicit negative comments from participants. This methodology seems to give participants "permission" to be critical of the system. Not only do participants choose negative as well as positive adjectives, they may also place a "negative" spin on an otherwise "positive" adjective. For example, "Sophisticated" at first sounds positive but I have had participants choose this item to mean, "It's a bit too sophisticated for my tastes".
An alternative implementation
Asking people to sort through a set of product reaction cards adds a level of complexity to the implementation that's not really necessary. In our studies, we now use a simple paper checklist of adjectives. We first ask people to read through the words and select as many as they like that they think apply to the interface. We then ask the participant to circle just 5 adjectives from those chosen, and these adjectives become the basis of the post-test guided interview.
Customising the word list
The precise adjectives are not set in stone remember this is a technique to help participants categorise their reactions to an interface that you then explore in more depth in the post-test guided interview. This means that, for a particular study, you should replace some of the words with others that may be more relevant. For example, if we were usability testing a web site for a client whose brand values are "Fun, Value for Money, Quality and Innovation", we would replace four of the existing adjectives with those. (This makes for an interesting discussion with the client when participants don't select those terms. It gets even more interesting if participants choose antonyms to the brand values, such as "Boring", "Expensive", "Inferior" and "Traditional"). This is similar to Brand Tags: whatever people say a brand is, is what it is.
How to analyse the data
The real benefit of this approach is in the way it uncovers participant reactions and attitudes. You get a depth of understanding and an authenticity in participants' reactions that just can't be achieved with traditional questionnaires and surveys. So this approach is ideal as a qualitative approach to guide an interview.
But you can also derive metrics from these data. Here's how.
The simplest measure is to count up the number of times a word was chosen by participants. In our studies, we find that we get a fair amount of consistency in the words chosen. For example, Figure 1 shows a word cloud from the results we obtained from a recent 12-participant usability test.
Figure 1: Example word cloud. The larger the font size and the greater the contrast, the more frequently participants selected the adjective.
Participants could choose from a corpus of 103 words but some words were selected more often (such as "Easy to use", which was selected by half the participants). The font size of each item in the word cloud is directly proportional to the number of times the adjective was selected (the Figure also shows less frequently selected adjectives in lower contrast text). If you don't feel comfortable hacking Word to create a word cloud, use the excellent Wordle, a web site that will make these word clouds for you and provides lots of control over the font used and the placement of text.
Verbal protocol analysis
A more robust statistic can be derived from carrying out a verbal protocol analysis of the guided interview where the participant discusses the reasons for his or her choice of words. This simply means listening to the post-test interview and coding each participant's comments. The simplest way to do this is to divide a piece of paper into two columns and write "Positive" at the top of one column and "Negative" at the top of the other column. Listen to the interview (either live or recorded) and every time you hear the participant make a positive comment about the interface, place a mark in the "Positive" column. Every time you hear the participant make a negative comment about the interface, place a mark in the "Negative" column. At the end of the interview, you add up the positive and negative totals and compute the percentage of positive comments.
So for example, if there are 5 positive comments and 5 negative comments the percentage of positive comments is 50% (5 divided by 10). Similarly, if there are 9 positive comments and 3 negative comments the percentage of positive comments is 75% (9 divided by 12). This could be used as a satisfaction metric to compare interfaces.
Now you try
If you would like to try out this method in one of your own studies, we've developed an Excel spreadsheet that you can use to generate and randomise the word list. (Randomisation of the list prevents order effects). The spreadsheet also contains a worksheet that lets you analyse the data and generate a word cloud. We do this by using an advanced feature in Wordle. (It bothers us that Wordle applies colours randomly. We want the colour to convey information like the text size does, as in Figure 1 above. So we used some Excel tomfoolery to generate colour information for Wordle. This way, the most popular adjectives are also the darkest and the less popular comments fade into the distance). The Excel file contains macros; you can disable the macros if you want and still print the word list, but you'll lose the randomisation and analysis functionality. I hope you find it useful to start collecting more in-depth measures of user satisfaction.
By the way, if you find this spreadsheet useful then you'll love our Usability Test Plan Toolkit. The word list is just one of 6 appendices in the Test Plan Toolkit, which has everything you need to conduct your next usability test.
Benedek, J. and Miner, T. "Measuring Desirability: New Methods for Evaluating Desirability in a Usability Lab Setting." (Word document) Redmond, WA: Microsoft Corporation, 2002.
Cronbach, L.J. (1946) Response sets and test validity. Educational and Psychological Measurements 6, pp. 475-494.
Wiklund, M., Thurrott, C. and Dumas, J. (1992). "Does the Fidelity of Software Prototypes Affect the Perception of Usability?" Proc. Human Factors Society 36th Annual Meeting, 399-403.
About the author
Dr. David Travis (@userfocus on Twitter) holds a BSc and a PhD in Psychology and he is a Chartered Psychologist. He has worked in the fields of human factors, usability and user experience since 1989 and has published two books on usability. David helps both large firms and start ups connect with their customers and bring business ideas to market. If you like his articles, you'll love his online user experience training course.
Love it? Hate it? Join the discussioncomments powered by Disqus
Foundation Certificate in UX
Gain hands-on practice in all the key areas of UX while you prepare for the BCS Foundation Certificate in User Experience. More details
Every month, we share an in-depth article on user experience with over 10,000 newsletter readers. Want in? Sign up now and download a free guide to usability test moderation.
Usability test plan toolkit
This eBook contains all you need to make sure that you're fully prepared for your next usability test. Usability test plan toolkit.
User Experience Articles
Our most popular articles
Our most commented articles
Our most recent articles
- May 2: Measuring Usability With The System Usability Scale (SUS)
- Apr 4: 5 reasons why your first user research activity should be a usability test
- Mar 7: Keeping Yourself out of the Story: Controlling Experimenter Effects
- Feb 1: The 4 mistakes you’ll make as a usability test moderator
- Jan 4: Desk research: the what, why and how
Search for articles by keyword
- 7 articles tagged accessibility
- 4 articles tagged axure
- 5 articles tagged benefits
- 12 articles tagged careers
- 8 articles tagged case study
- 1 article tagged css
- 8 articles tagged discount usability
- 2 articles tagged ecommerce
- 10 articles tagged ethnography
- 14 articles tagged expert review
- 1 article tagged fitts law
- 4 articles tagged focus groups
- 1 article tagged forms
- 6 articles tagged guidelines
- 10 articles tagged heuristic evaluation
- 7 articles tagged ia
- 14 articles tagged iso 9241
- 9 articles tagged iterative design
- 3 articles tagged layout
- 1 article tagged legal
- 11 articles tagged metrics
- 3 articles tagged mobile
- 7 articles tagged moderating
- 3 articles tagged morae
- 2 articles tagged navigation
- 9 articles tagged personas
- 15 articles tagged prototyping
- 7 articles tagged questionnaires
- 1 article tagged quotations
- 4 articles tagged roi
- 16 articles tagged selling usability
- 12 articles tagged standards
- 41 articles tagged strategy
- 2 articles tagged style guide
- 4 articles tagged survey design
- 5 articles tagged task scenarios
- 2 articles tagged templates
- 21 articles tagged tools
- 50 articles tagged usability testing
- 3 articles tagged user manual