5.5 TTS versus ASR-oriented recording

Voice data can be used for different types of language technology. For TWB Voice, it was helpful to understand the goal of the dataset (building TTS or ASR). We could then adapt our recording instructions and our criteria for validation.

Text-to-Speech (TTS)

TTS models convert text into speech that sounds like a human. These models need a balanced dataset and the voices should be even and clear. Recordings must:

be neutral and have a consistent tone
use standard pronunciation and a normal speed of speaking
avoid emotion and dramatic speech, and irregular speech patterns
be done in a controlled space with no background noise, following strict quality instructions

Automatic Speech Recognition (ASR)

ASR models convert spoken language into written text. For ASR, recordings should:

be as natural and varied as possible, just like real-life conversations
include lots of different accents, speeds, and genders
include many types of speaker (age, gender, region)

Key takeaway

High-quality recordings and fair validation will give you usable and high-quality voice datasets.

Use clear guidelines and checklists. Make sure your recordings match the goals of your project.

Previous5.4 Checklists before recording Next6. Community engagement

Last updated 7 months ago

Was this helpful?

hashtagText-to-Speech (TTS)

hashtagAutomatic Speech Recognition (ASR)

Text-to-Speech (TTS)

Automatic Speech Recognition (ASR)