5.5 TTS versus ASR-oriented recording

Voice data can be used for different types of language technology. For TWB Voice, it was helpful to understand the goal of the dataset (building TTS or ASR). We could then adapt our recording instructions and our criteria for validation.

Text-to-Speech (TTS)

TTS models convert text into speech that sounds like a human. These models need a balanced dataset and the voices should be even and clear. Recordings must:

  • be neutral and have a consistent tone

  • use standard pronunciation and a normal speed of speaking

  • avoid emotion and dramatic speech, and irregular speech patterns

  • be done in a controlled space with no background noise, following strict quality instructions

Automatic Speech Recognition (ASR)

ASR models convert spoken language into written text. For ASR, recordings should:

  • be as natural and varied as possible, just like real-life conversations

  • include lots of different accents, speeds, and genders

  • include many types of speaker (age, gender, region)

:bulb: Key takeaway

High-quality recordings and fair validation will give you usable and high-quality voice datasets.

Use clear guidelines and checklists. Make sure your recordings match the goals of your project.

Last updated

Was this helpful?