980M parameter LLM is trained with “emergent abilities” by Amazon.

A new large language model (LLM) for text-to-speech has been trained by Amazon researchers, and they assert that it demonstrates “emergent” capabilities.

The largest text-to-speech model to date is BASE TTS, a 980 million parameter model. The goal of the study was to determine whether training models of different sizes on up to 100,000 hours of public domain voice data will result in the same kind of performance leaps that are seen in natural language processing models as they scale up.

They discovered that the adaptability and resilience of their medium-sized 400 million parameter model, which was trained on 10,000 hours of audio, had significantly improved on challenging test words.

Text-to-speech systems are typically confused by complex lexical, syntactic, and paralinguistic elements included in test phrases, such as compound nouns, emotions, foreign words, and punctuation. Even though it wasn’t flawless, BASE TTS produced a lot fewer pronunciation, intonation, and stress mistakes than previous models.

The researchers noted that “these sentences are designed to contain challenging tasks—none of which BASE TTS is explicitly trained to perform.”

Beyond the 400 million parameter version, the greatest 980 million parameter version of the model, trained on 100,000 hours of audio, showed no further capabilities.

The development of BASE TTS shows that, while being an experimental procedure, these models can scale to new versatility levels, which is optimistic for conversational AI. The goal of future research is to determine the ideal model size for emerging talents.

Additionally, the model is made to be lightweight and streamable, with prosodic and emotional data packaged independently. This would make it possible to transfer the natural-sounding spoken audio over low-bandwidth connections.

Leave a Comment