A Brief History of ASR: Automatic Speech Recognition Part 1
Here at SnatchBot we believe in a future where intelligent, conversational AI provides an essential tool for all human activity. And that future where, for example, you chat to your car to adjust the route, the temperature, the radio… it isn’t very far away.
Last week we reached an important milestone in our own pipeline, when we added voices to our chatbots. Once again, users can have this wonderful new feature for free, one that enormously enhances user experience.
Further along that road will be speech recognition. And to understand how that has become possible and to appreciate just how exciting are the times we currently live in, it’s worth looking back over the history of this technology to see how far it has come and to appreciate that we are now at a tipping point.
Early Days
Human interest in recognizing and synthesizing speech dates back hundreds of years, but it wasn’t until the mid-twentieth century that our forebears built something recognizable as ASR.
1961 — IBM Shoebox
Among the earliest projects was a ‘digit recognizer’ called Audrey, created by researchers at Bell Laboratories in 1952. Audrey could recognize spoken numerical digits by looking for audio fingerprints called formants: the distilled essences of sounds.
In the 1960s, IBM developed Shoebox : a system that could recognize digits and arithmetic commands like ‘plus’ and ‘total’. Better yet, Shoebox could pass the math problem to an adding machine, which would calculate and print the answer.
At the same time, researchers in Japan built hardware that could recognize the constituent parts of speech like vowels; other systems could evaluate the structure of speech to figure out where a word might end. And a team at University College in England could recognize 4 vowels and 9 consonants by analyzing phonemes, the discrete sounds of a language.
But while the field was taking incremental steps forward, it wasn’t necessarily clear where the path was heading. And then a setback.
October 1969 — The Journal of the Acoustical Society of America
A Piercing Freeze
A hiatus in the development of ASR was signaled in a letter written by John R. Pierce in 1969. Pierce had long since established himself as an engineer of international renown: among other achievements he coined the word transistor (now ubiquitous in engineering) and helped launch Echo I, the first-ever communications satellite. By 1969, he was an executive at Bell Labs, which had been a major driving force in the development of speech recognition.
In an open letter published in The Journal of the Acoustical Society of America, Pierce laid out his concerns. Citing a ‘lush’ funding environment in the aftermath of World War II and Sputnik, and the lack of accountability thereof, Pierce admonished the field for its lack of scientific rigor, asserting that there was too much wild experimentation going on:
We all believe that a science of speech is possible, despite the scarcity in the field of people who behave like scientists and of results that look like science.
Pierce put his employer’s money where his mouth was: he defunded Bell’s ASR programs, which wouldn’t be reinstated until after he resigned in 1971.
Progress Continues
Elsewhere, however, there was more optimism elsewhere. In the early 1970s, the U.S. Department of Defense’s ARPA (the agency now known as DARPA) funded a five-year program called Speech Understanding Research. This led to the creation of several new ASR systems, the most successful of which was Carnegie Mellon University’s Harpy, which could recognize just over 1000 words by 1976.
Meanwhile efforts from IBM and AT&T’s Bell Laboratories pushed the technology toward possible commercial applications. IBM prioritized speech transcription in the context of office correspondence, and Bell was concerned with ‘command and control’ scenarios: the precursors to the voice dialing and automated phone trees we know today.
Despite this progress, by the end of the 1970s ASR was still a long way from being viable for anything but highly-specific use-cases.
In part two, we’ll see how these hesitant early steps became the confident race to ASR we are experiencing today.