5.5 TTS versus ASR-oriented recording

can be used for different types of . For , it was helpful to understand the goal of the (building or ). We could then adapt our recording instructions and our criteria for validation.

Text-to-Speech (TTS)

TTS convert text into speech that sounds like a human. These models need a balanced dataset and the voices should be even and clear. Recordings must:

  • be neutral and have a consistent tone

  • use standard pronunciation and a normal speed of speaking

  • avoid emotion and dramatic speech, and irregular speech patterns

  • be done in a controlled space with no background noise, following strict quality instructions

Automatic Speech Recognition (ASR)

ASR models convert spoken language into written text. For ASR, recordings should:

  • be as natural and varied as possible, just like real-life conversations

  • include lots of different accents, speeds, and genders

  • include many types of speaker (age, gender, region)

Key takeaway

High-quality recordings and fair validation will give you usable and high-quality voice datasets.

Use clear guidelines and checklists. Make sure your recordings match the goals of your project.

Last updated

Was this helpful?