This post was originally published on this site

ltechnologygroup.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com.

http://tctechcrunch2011.files.wordpress.com/2017/08/speech-recognition.png

Creating convincing artificial speech is a hot pursuit right now, with Google arguably in the lead. The company may have leapt ahead again with the announcement today of Tacotron 2, a new method for training a neural network to produce realistic speech from text that requires almost no grammatical expertise.

The new technique takes the best pieces of two of Google’s previous speech generation projects: WaveNet and the original Tacotron.

WaveNet produced what I called “eerily convincing” speech one audio sample at a time, which probably sounds like overkill to anyone who knows anything about sound design. But while it is effective, WaveNet requires a great deal of metadata about language to start out: pronunciation, known linguistic features, etc. Tacotron synthesized more high-level features, such as intonation and prosody, but wasn’t really suited for producing a final speech product.

Tacotron 2 uses pieces of both, though I will frankly admit that at this point that we have reached the limits of my technical expertise, such as it is. But from what I can tell, it uses text and narration of that text to calculate all the linguistic rules that systems normally have to be explicitly told. The text itself is converted into a Tacotron-style “mel-scale spectrogram” for purposes of rhythm and emphasis, while the words themselves are generated using a WaveNet-style system.

That should make everything clear!

The resulting audio, several examples of which you can listen to here, is pretty much as good as or better than anything out there. The rhythm of speech is convincing, though perhaps a bit too chipper. It mainly stumbles on words with pronunciations that aren’t particularly intuitive, perhaps due to their origin outside American English, such as “decorum,” where it emphasizes the first syllable, and “merlot,” which it hilariously pronounces just as it looks. “And in extreme cases it can even randomly generate strange noises,” the researchers write.

Lastly, there is no way to control the tone of speech — for example upbeat or concerned — although accents and other subtleties can be baked in as they could be with WaveNet.

Lowering the barrier for training a system means more and better ones can be trained, and new approaches integrated without having to re-evaluate a complex manually defined ruleset or source new such rulesets for new languages or speech styles.

The researchers have submitted it for consideration at the IEEE International Conference on Acoustics, Speech and Signal Processing; you can read the paper itself at Arxiv.

Featured Image: Bryce Durbin/TechCrunch

At L Technology Group, we know technology alone will not protect us from the risks associated with in cyberspace. Hackers, Nation States like Russia and China along with “Bob” in HR opening that email, are all real threats to your organization. Defending against these threats requires a new strategy that incorporates not only technology, but also intelligent personnel who, eats and breaths cybersecurity. Together with proven processes and techniques combines for an advanced next-generation security solution. Since 2008 L Technology Group has develop people, processes and technology to combat the ever changing threat landscape that businesses face day to day.

Call Toll Free (855) 999-6425 for a FREE Consultation from L Technology Group, https://www.ltechnologygroup.com.