5.5 TTS versus ASR-oriented recording
can be used for different types of . For , it was helpful to understand the goal of the (building or ). We could then adapt our recording instructions and our criteria for validation.
Text-to-Speech (TTS)
TTS convert text into speech that sounds like a human. These models need a balanced dataset and the voices should be even and clear. Recordings must:
be neutral and have a consistent tone
use standard pronunciation and a normal speed of speaking
avoid emotion and dramatic speech, and irregular speech patterns
be done in a controlled space with no background noise, following strict quality instructions
Automatic Speech Recognition (ASR)
ASR models convert spoken language into written text. For ASR, recordings should:
be as natural and varied as possible, just like real-life conversations
include lots of different accents, speeds, and genders
include many types of speaker (age, gender, region)
Key takeaway
High-quality recordings and fair validation will give you usable and high-quality voice datasets.
Use clear guidelines and checklists. Make sure your recordings match the goals of your project.
Last updated
Was this helpful?