5.5 TTS versus ASR-oriented recording
Voice data can be used for different types of language technology. For TWB Voice, it was helpful to understand the goal of the dataset (building TTS or ASR). We could then adapt our recording instructions and our criteria for validation.
Text-to-Speech (TTS)
TTS models convert text into speech that sounds like a human. These models need a balanced dataset and the voices should be even and clear. Recordings must:
be neutral and have a consistent tone
use standard pronunciation and a normal speed of speaking
avoid emotion and dramatic speech, and irregular speech patterns
be done in a controlled space with no background noise, following strict quality instructions
Automatic Speech Recognition (ASR)
ASR models convert spoken language into written text. For ASR, recordings should:
be as natural and varied as possible, just like real-life conversations
include lots of different accents, speeds, and genders
include many types of speaker (age, gender, region)
Key takeaway
High-quality recordings and fair validation will give you usable and high-quality voice datasets.
Use clear guidelines and checklists. Make sure your recordings match the goals of your project.
Last updated
Was this helpful?
