Get hands-on practice in all the key areas of UX and prepare for the BCS Foundation Certificate.
Although the performance of today's speech recognition systems is impressive, the experience for many is still one of errors, corrections, frustration and abandoning speech in favour of alternative interaction methods. We take a closer look at speech and find out why speech recognition is so difficult.
What do smashed eggs have to do with speech recognition? Read on to find out.
This is a two part article. In this first part, I take a closer look at speech itself and at why speech recognition is difficult. In the next part, I discuss some VUI design principles for managing recognition errors.
Sci-fi TV shows and movies have always got speech wrong. Characters talk naturally and effortlessly to the computers and the computers understand everything perfectly. Then the computer replies in a voice that sounds like someone just swallowed a cheese grater. In reality it's the other way round. Making a computer talk has always been easier than making it listen. Speech synthesis has always outpaced speech recognition. Speech recognition is hard.
But there's no denying the progress that's been made. We've come a long way since 1952 when Bell Labs' Audrey was able to recognize ten digits spoken by a single talker. Deep neural networks, improved learning algorithms, and brute force computing power have now made recognition of continuous speech possible with minimal system training. So now we expect Siri and Alexa to easily recognize speech rather than wreck a nice beach, and to show us a new display rather than a nudist play.
And for the most part things work well. Only yesterday Siri told me the time. "It is 10:35 a.m." she said. And it was. A few years ago we'd have been blown away by that. Alas, I'd not actually asked her what the time was. I'd asked her, "What is the Burnley versus Manchester City score?" Still, she got it on the second attempt.
Our expectation, set by Star Trek 50 years ago, and elevated by the sheer ubiquity of today's applications, that we can talk naturally to a computer and it will understand us, is getting closer and closer to realisation, so much so that we quite overlook the fact that there's nothing natural about talking to a computer and that the computer doesn't really understand us at all.
OK, that's enough with being impressed. The reality is that, for many users, the experience is still one of recognition errors, repeating ourselves, getting frustrated, and giving up and using a keypad. But why?
Speech is a three-dimensional physical event: spectral energy varying in amplitude and unfolding over time. Why can't we just extract the acoustic cues and map them, one-to-one, on to phonemes (the smallest units of speech that distinguish one word from another)? It sounds like a plan, but it's not that simple. Not even close. It turns out to be less to do with simple pattern matching and much more like trying to break the Enigma code. To get to the root of the problem we must forget about the applications and the technology and the recognition algorithms for a moment and focus on the uniquely human phenomenon of speech itself.
The real problem begins way before the speech signal even reaches the speech recognizer. It begins with the way we speak and with a phenomenon called coarticulation.
Most speech sounds (phonetic segments) are not articulated, they are coarticulated. Coarticulation is the simultaneous production of more than one speech sound. As an example, let's take a sentence that most two-year olds can understand without breaking sweat: 'The cat sat on the mat.' By the way, I just ran this by Siri and she heard it as, "I can't set an alert". She then helpfully replied, "I suppose you can't."
Focus on the word 'sat' and try this for yourself: say 'sat' aloud. As you say it pay attention to the shape of your mouth, lips and tongue. Do it slowly if it helps or stand in front of a mirror. Now change just the vowel sound and say the word 'suit'. See how your mouth and lips and tongue change shape to utter the /s/ sound in each case. It's the same phoneme /s/ but produced with different articulatory gestures, giving rise to different acoustic properties. This is because as we produce the consonant /s/ we are simultaneously coarticulating the following vowel. You can try the same thing with the phoneme /k/ in 'key' and 'cool'. In controlled experiments in which the consonants /s/ or /k/ are excised and presented without the following vowel, listeners are able to accurately identify the missing vowel just from the cues in the consonant.
But what's actually happening to the acoustic-phonetic cues that the brain (and the speech recognizer) need to extract and identify? To use the metaphor made famous by linguist Charles Hockett, don't think of phonetic segments as if they were beads on a string, coming at you one after the other in an orderly sequence, think of them as eggs smashed through a wringer, with acoustic information from different segments smeared and mixed together. Beads on a string is what we perceive because our brains instantly and effortless extract language and meaning. But eggs through a wringer is what our brains (and what speech recognizers) are presented with, and what they have to untangle.
An easy way to actually experience what 'eggs through a wringer' sounds like is to simply listen to the sounds of a language that you don't understand. What you hear is not a sequence of discrete words but a continuous stream of acoustic energy with no obvious beginnings and endings of words or phonemes. This is the problem of lexical segmentation. This smearing of acoustic cues across phonetic and lexical boundaries is one problematic consequence of coarticulation.
But hold on to your hat, it gets worse…
Because of coarticulation, distinctive acoustic properties do not correspond in a simple and predictable way to phonetic segments. There is no single invariant acoustic property in the speech signal that corresponds uniquely to a given phoneme. This means that speech recognition (by brain or by technology) is not simply a matter of identifying acoustic property X and interpreting it as phoneme X'. Sometimes acoustic property X will give rise to phoneme Y', and sometimes phoneme X' will be cued by acoustic property Z. Almost every phonetic distinction has multiple correlates in the acoustic speech signal. In normal speech, the articulatory gymnastics required to utter one phoneme after another cause changes that have a ripple effect on the acoustic spectrum, its amplitude and the temporal properties of the speech signal. To complicate matters still further, these changing acoustic-phonetic properties (including periods of silence — the total absence of any acoustic energy at all) can trade off against each other and yet still give the same percept. What's more, different talkers making the same utterance produce different acoustic properties, and even the same talker successively uttering the same word does not produce exactly the same acoustic cues. This thorny problem, known as the lack of acoustic invariance is another consequence of coarticulation.
The speech recognizer is not just trying to unscramble eggs, it's trying to do it while the mess of eggs is constantly being stirred with a spoon.
Given the complexities of the speech signal, it should not be a surprise that speech recognition systems make errors. It's impressive that they don't make more. Yet most companies quote recognition accuracy rates in the order of 95%. Nuance claims "up to 99% recognition accuracy" for its latest version of Dragon (Nuance also provides the recognition engine for Siri).
Although 95% accuracy seems (and is) impressive, three things are important to note.
Of course, in chasing higher recognition rates, we should not forget that human-to-human word recognition accuracy is not 100%. Even professional transcribers have about a 5-6% word error rate. In everyday conversation we constantly mis-hear each other and have to repeat what we say, sometimes louder, sometimes more slowly or more deliberately, sometimes using different words. A bit like talking to a computer, in fact.
But here's the difference… humans are orders of magnitude better at recovering from errors than are computers. And it is errors and how to manage them that we need to turn to next.
Recognition failure, and the consequent failure to complete a task, knocks the user off course and can quickly trigger spiralling degradation as one error triggers another. The VUI designer's job is to stop this from happening and to get the user back on track as quickly and as seamlessly as possible. This is why VUI designers are typically advised to "design for scenarios where speech recognition fails." It's good advice. In fact, I'd go a step further: Think of your VUI as an error recovery system. This means assuming that the recognizer will fail (because it will), and it means understanding, anticipating, and designing for worst case listening conditions.
Coming up in Part 2: 'VUI as an error recovery system': We'll go beyond the speech signal and look at the kinds of natural human communication behaviours that can trip up a speech recognizer, and we'll look at the VUI design principles that can help ensure your users don't end up like these two Scotsmen.
Gain hands-on practice in all the key areas of UX while you prepare for the BCS Foundation Certificate in User Experience. More details
Every month, we share an in-depth article on user experience with over 10,000 newsletter readers. Want in? Sign up now and get free, exclusive access to our reports and eBooks.
Our most recent videos
Our most recent articles
copyright © Userfocus 2020.
Get hands-on practice in all the key areas of UX and prepare for the BCS Foundation Certificate.
We can tailor our user research and design courses to address the specific issues facing your development team.
Users don't always know what they want and their opinions can be unreliable — so we help you get behind your users' behaviour.