2.1 Technologies that use voice data

Text-to-Speech (TTS)

convert written text into synthesized speech. These systems analyze text and create audio that copies human speech patterns. Modern TTS systems can:

  • generate speech with a style of speaking and rhythm that sounds natural

  • adapt to different speaking styles, accents, and moods

  • read out text in real time to help people with accessibility needs

  • allow people to use apps and services that are controlled by voice

To develop high-quality TTS systems, you generally need clean, studio-quality voice recordings. These systems work best when trained on data from controlled settings. There should be very little background noise, the audio level should be consistent, and the speakers should be professionals. To create natural-sounding synthesized speech, you need someone with good pronunciation who speaks clearly and expresses themselves in a fitting way.

Automatic Speech Recognition (ASR)

systems, or speech-to-text, convert spoken language into written text. These systems:

  • process audio recordings and identify elements of speech

  • take spoken content and convert it into written text

  • allow people to use voice assistants and interfaces that are controlled by voice

  • allow automated transcription of meetings, interviews, and other spoken content

ASR systems work better when trained on voice data that is diverse and varied. It should include a wide range of speakers, accents, and settings. To be effective, ASR need to be trained with data that reflects the full range of real-world speech. This should include different ages, genders, accents, speaking styles and acoustic settings (for example, in noisy traffic).

Last updated

Was this helpful?