Talking to computers (part 1): Why is speech recognition so difficult?

Although the performance of today's speech recognition systems is impressive, the experience for many is still one of errors, corrections, frustration and abandoning speech in favour of alternative interaction methods. We take a closer look at speech and find out why speech recognition is so difficult. — Philip Hodgson, Jun 3, 2019

By Philip Hodgson Jun 3, 2019 / guidelines

Smashed eggs

What do smashed eggs have to do with speech recognition? Read on to find out.

This is a two part article. In this first part, I take a closer look at speech itself and at why speech recognition is difficult. In the next part, I discuss some VUI design principles for managing recognition errors.

The speech in speech recognition

Sci-fi TV shows and movies have always got speech wrong. Characters talk naturally and effortlessly to the computers and the computers understand everything perfectly. Then the computer replies in a voice that sounds like someone just swallowed a cheese grater. In reality it's the other way round. Making a computer talk has always been easier than making it listen. Speech synthesis has always outpaced speech recognition. Speech recognition is hard.

But there's no denying the progress that's been made. We've come a long way since 1952 when Bell Labs' Audrey was able to recognize ten digits spoken by a single talker. Deep neural networks, improved learning algorithms, and brute force computing power have now made recognition of continuous speech possible with minimal system training. So now we expect Siri and Alexa to easily recognize speech rather than wreck a nice beach, and to show us a new display rather than a nudist play.

And for the most part things work well. Only yesterday Siri told me the time. "It is 10:35 a.m." she said. And it was. A few years ago we'd have been blown away by that. Alas, I'd not actually asked her what the time was. I'd asked her, "What is the Burnley versus Manchester City score?" Still, she got it on the second attempt.

Our expectation, set by Star Trek 50 years ago, and elevated by the sheer ubiquity of today's applications, that we can talk naturally to a computer and it will understand us, is getting closer and closer to realisation, so much so that we quite overlook the fact that there's nothing natural about talking to a computer and that the computer doesn't really understand us at all.

OK, that's enough with being impressed. The reality is that, for many users, the experience is still one of recognition errors, repeating ourselves, getting frustrated, and giving up and using a keypad. But why?

Why is speech recognition difficult?

Speech is a three-dimensional physical event: spectral energy varying in amplitude and unfolding over time. Why can't we just extract the acoustic cues and map them, one-to-one, on to phonemes (the smallest units of speech that distinguish one word from another)? It sounds like a plan, but it's not that simple. Not even close. It turns out to be less to do with simple pattern matching and much more like trying to break the Enigma code. To get to the root of the problem we must forget about the applications and the technology and the recognition algorithms for a moment and focus on the uniquely human phenomenon of speech itself.

The problem of coarticulation

The real problem begins way before the speech signal even reaches the speech recognizer. It begins with the way we speak and with a phenomenon called coarticulation.

Most speech sounds (phonetic segments) are not articulated, they are coarticulated. Coarticulation is the simultaneous production of more than one speech sound. As an example, let's take a sentence that most two-year olds can understand without breaking sweat: 'The cat sat on the mat.' By the way, I just ran this by Siri and she heard it as, "I can't set an alert". She then helpfully replied, "I suppose you can't."

Focus on the word 'sat' and try this for yourself: say 'sat' aloud. As you say it pay attention to the shape of your mouth, lips and tongue. Do it slowly if it helps or stand in front of a mirror. Now change just the vowel sound and say the word 'suit'. See how your mouth and lips and tongue change shape to utter the /s/ sound in each case. It's the same phoneme /s/ but produced with different articulatory gestures, giving rise to different acoustic properties. This is because as we produce the consonant /s/ we are simultaneously coarticulating the following vowel. You can try the same thing with the phoneme /k/ in 'key' and 'cool'. In controlled experiments in which the consonants /s/ or /k/ are excised and presented without the following vowel, listeners are able to accurately identify the missing vowel just from the cues in the consonant.

You can't make a speech recognizer without breaking eggs

But what's actually happening to the acoustic-phonetic cues that the brain (and the speech recognizer) need to extract and identify? To use the metaphor made famous by linguist Charles Hockett, don't think of phonetic segments as if they were beads on a string, coming at you one after the other in an orderly sequence, think of them as eggs smashed through a wringer, with acoustic information from different segments smeared and mixed together. Beads on a string is what we perceive because our brains instantly and effortless extract language and meaning. But eggs through a wringer is what our brains (and what speech recognizers) are presented with, and what they have to untangle.

An easy way to actually experience what 'eggs through a wringer' sounds like is to simply listen to the sounds of a language that you don't understand. What you hear is not a sequence of discrete words but a continuous stream of acoustic energy with no obvious beginnings and endings of words or phonemes. This is the problem of lexical segmentation. This smearing of acoustic cues across phonetic and lexical boundaries is one problematic consequence of coarticulation.

But hold on to your hat, it gets worse…

Lack of acoustic invariance

Because of coarticulation, distinctive acoustic properties do not correspond in a simple and predictable way to phonetic segments. There is no single invariant acoustic property in the speech signal that corresponds uniquely to a given phoneme. This means that speech recognition (by brain or by technology) is not simply a matter of identifying acoustic property X and interpreting it as phoneme X'. Sometimes acoustic property X will give rise to phoneme Y', and sometimes phoneme X' will be cued by acoustic property Z. Almost every phonetic distinction has multiple correlates in the acoustic speech signal. In normal speech, the articulatory gymnastics required to utter one phoneme after another cause changes that have a ripple effect on the acoustic spectrum, its amplitude and the temporal properties of the speech signal. To complicate matters still further, these changing acoustic-phonetic properties (including periods of silence — the total absence of any acoustic energy at all) can trade off against each other and yet still give the same percept. What's more, different talkers making the same utterance produce different acoustic properties, and even the same talker successively uttering the same word does not produce exactly the same acoustic cues. This thorny problem, known as the lack of acoustic invariance is another consequence of coarticulation.

The speech recognizer is not just trying to unscramble eggs, it's trying to do it while the mess of eggs is constantly being stirred with a spoon.

Some thoughts on recognition performance

Given the complexities of the speech signal, it should not be a surprise that speech recognition systems make errors. It's impressive that they don't make more. Yet most companies quote recognition accuracy rates in the order of 95%. Nuance claims "up to 99% recognition accuracy" for its latest version of Dragon (Nuance also provides the recognition engine for Siri).

Although 95% accuracy seems (and is) impressive, three things are important to note.

The performance rates that companies report are based on their own test methods and are the best possible rates obtainable in optimal conditions. You may or may not get those rates in everyday use.
The accuracy rates refer to the 'per word' recognition rate. Not the 'per utterance' or 'per sentence' rate and not the 'per completed task' rate. This matters because a task depending on the correct recognition of every word in a sequence of, say, 10 or 20 words, even with the help of top-down knowledge, is going to have a completion rate some way below 95%. And bear in mind that users don't care about high recognition rates, they only care about high task completion rates.
20 years ago recognition rates were reported as 95%. Go back another 10 years and, lo and behold, they were reported as 95% back in the 1980s too. In fact 70 years ago Audrey managed 90-97% correct recognition. In spite of advancements in technology and computing power it seems as though we may be hitting a ceiling for recognition accuracy. This could be as good as it gets.

Of course, in chasing higher recognition rates, we should not forget that human-to-human word recognition accuracy is not 100%. Even professional transcribers have about a 5-6% word error rate. In everyday conversation we constantly mis-hear each other and have to repeat what we say, sometimes louder, sometimes more slowly or more deliberately, sometimes using different words. A bit like talking to a computer, in fact.

But here's the difference… humans are orders of magnitude better at recovering from errors than are computers. And it is errors and how to manage them that we need to turn to next.

Recovering from errors

Recognition failure, and the consequent failure to complete a task, knocks the user off course and can quickly trigger spiralling degradation as one error triggers another. The VUI designer's job is to stop this from happening and to get the user back on track as quickly and as seamlessly as possible. This is why VUI designers are typically advised to "design for scenarios where speech recognition fails." It's good advice. In fact, I'd go a step further: Think of your VUI as an error recovery system. This means assuming that the recognizer will fail (because it will), and it means understanding, anticipating, and designing for worst case listening conditions.

Coming up in Part 2: 'VUI as an error recovery system': We'll go beyond the speech signal and look at the kinds of natural human communication behaviours that can trip up a speech recognizer, and we'll look at the VUI design principles that can help ensure your users don't end up like these two Scotsmen.

-oOo-

No matter how simple your system, someone, somewhere, will make an error when using it. The difference between a great user experience and an awful one is what your system does next. Read more about communicating errors.

Phillip Hodgson Dr. Philip Hodgson (@bpusability on Twitter) has been a UX researcher for over 25 years. His work has influenced design for the US, European and Asian markets, for everything from banking software and medical devices to store displays, packaging and even baby care products. His book, Think Like a UX Researcher, was published in January 2019.

Foundation Certificate in UX

Gain hands-on practice in all the key areas of UX while you prepare for the BCS Foundation Certificate in User Experience. More details

Download the best of Userfocus. For free.

100s of pages of practical advice on user experience, in handy portable form. 'Bright Ideas' eBooks.

Our most recent videos

Jul 3: User research when social distancing
Jun 19: How to create bulletproof survey questions
Jun 12: Can you re-use usability test participants?
Jun 5: Why you don't need user representatives
May 29: Should a design agency test its own design?

Our most recent articles

See all videos

Get help with…

UX Certification

Get hands-on practice in all the key areas of UX and prepare for the BCS Foundation Certificate.

Learn more…
In-House Usability Training Courses

We can tailor our user research and design courses to address the specific issues facing your development team.

Learn more…
User Experience Consultancy

Users don't always know what they want and their opinions can be unreliable — so we help you get behind your users' behaviour.

Learn more…

If you liked this, try…

Get our newsletter (And a free guide to usability test moderation)

No thanks

Talking to computers (part 1): Why is speech recognition so difficult?

The speech in speech recognition

Why is speech recognition difficult?

The problem of coarticulation

You can't make a speech recognizer without breaking eggs

Lack of acoustic invariance

Some thoughts on recognition performance

Recovering from errors

Foundation Certificate in UX

Download the best of Userfocus. For free.

Related articles & resources

User Experience Articles & Videos

Filter articles by keyword

Our services

Upcoming courses

Training courses

Get help with…

UX Certification

In-House Usability Training Courses

User Experience Consultancy

If you liked this, try…