Carnegie Mellon University


March 31, 2020

Recognizing Speech Recognition

By Madison Brewer

Krista Burns

Speech-to-text translation has come a long way, but anyone with a smartphone capable of the feature knows it still has some way to go. Fortunately, Richard Stern, a professor in the department of electrical and computer engineering, and his group research new ways to improve speech recognition software and have created an algorithm that introduces a new feature: power-normalized cepstral coefficients (PNCC). This work was part of Stern’s graduate student Chanwoo Kim’s doctoral thesis. Kim now serves as a vice president at Samsung Corporation in South Korea.

The IEEE Signal Processing Society, a worldwide organization committed to advancing this area of science, honored Stern and Kim with a 2019 Best Paper Award for their 2016 paper on PNCC. The award will be presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) to be held in May 2020 in Barcelona.

Stern said he was honored to be given this award.

“There still is this real satisfaction from having an idea and having the idea be right, having useful results come from that,” Stern said. “It's very nice to have that recognition, even for a moment.”

Stern attempts to use knowledge of human auditory processing to guide his algorithm design, so PNCC is specifically structured to mimic some of our brain’s methods of interpreting words from soundwaves.

Stern’s research, funded by the National Science Foundation and Samsung, provides a new algorithm for the features of sound that speech recognition software uses to guess what words have been spoken. These features are designed to better recognize words spoken in difficult environments such as additive noise, traffic, background music, speaker interference, and reverberation. Stern said he focused especially on reverberation and other people talking in the background because these are especially difficult conditions for speech recognition.

“This is what robustness is intended to refer to,” Stern said. “Speech recognition systems should just work under any circumstance, regardless of what the acoustical environment might be.” 

Stern and Kim’s algorithm contains a unique feature called “medium-time processing.” One of the fundamental tradeoffs in signal analysis is choosing between a precise time measurement or a precise frequency measurement. While most speech recognition software works using short-time processing– sacrificing frequency precision – Stern and Kim supplement the short-time processing with a longer time analysis they call medium-time processing. Using medium-time processing, they can estimate the statistical characteristics of the background noise. These characteristics change slowly, so a precise time measurement isn’t needed, enabling a more precise frequency measurement. The PNCC coefficients are developed at both time scales: short-time for characterizing speech and longer-time for characterizing the environment.

Word recognition accuracy is the most important assessment of speech recognition software. In most cases, PNCC provides better recognition accuracy than other algorithms. Unfortunately, this comes with a slightly more costly computation. Computational cost is related to how long it takes for an algorithm to run — an algorithm that is too costly will be too complex for real-world use. According to Stern, PNCC processing is not substantially more costly than other algorithms. Furthermore, he says the increased cost is well worth the improved results. 

“If you do everything with all the bells and whistles, it might be one and a half to two times the original cost,” Stern said. “But the overall cost of signal processing nowadays is just a tiny fraction of the overall computational load.”

Stern said the most important feature of his research is the interplay between engineered speech recognition technology and the science of human auditory processing.

“The performance that we're getting is a consequence of applying principles based on what we know about auditory perception to this very engineering problem of speech recognition,” Stern said. “I think that this kind of success speaks to the benefit of...bringing together information, knowledge, and wisdom from multiple disciplines.”