2.1 Technologies that use voice data
Text-to-Speech (TTS)
convert written text into synthesized speech. These systems analyze text and create audio that copies human speech patterns. Modern TTS systems can:
generate speech with a style of speaking and rhythm that sounds natural
adapt to different speaking styles, accents, and moods
read out text in real time to help people with accessibility needs
allow people to use apps and services that are controlled by voice
To develop high-quality TTS systems, you generally need clean, studio-quality voice recordings. These systems work best when trained on data from controlled settings. There should be very little background noise, the audio level should be consistent, and the speakers should be professionals. To create natural-sounding synthesized speech, you need someone with good pronunciation who speaks clearly and expresses themselves in a fitting way.
Automatic Speech Recognition (ASR)
systems, or speech-to-text, convert spoken language into written text. These systems:
process audio recordings and identify elements of speech
take spoken content and convert it into written text
allow people to use voice assistants and interfaces that are controlled by voice
allow automated transcription of meetings, interviews, and other spoken content
ASR systems work better when trained on voice data that is diverse and varied. It should include a wide range of speakers, accents, and settings. To be effective, ASR need to be trained with data that reflects the full range of real-world speech. This should include different ages, genders, accents, speaking styles and acoustic settings (for example, in noisy traffic).
Last updated
Was this helpful?