2.3 Read voice data versus spontaneous voice data
Two types of voice data are needed to develop effective . is used for the foundation. helps to make the system more robust and sound more natural.
Read voice data
Read voice data consists of recordings of speakers who read out a prepared text. This type of data:
is highly controlled and structured
has clear pronunciation and a steady pace
follows standardized text formats
When building speech technologies for a new language, read voice data is often the first type of data that you would collect. It provides clean, predictable input for developing speech technology.
It's important to understand the difference between reading a text aloud and more natural communication. When they read aloud, people usually speak:
more formally
with fewer grammar errors
with more complete sentences
in a more monotone voice
without using words like "um" or "uh"
without restarting sentences
following the rules of written language
This means that read speech can sometimes sound a bit artificial. Technologies that only use read speech may not work well in real-life situations.
Spontaneous voice data
Spontaneous voice data consists of natural speech without a script. Speakers may not finish their sentences and they use regional expressions. They make constant changes to their speech, responding to feedback from listeners. This type of data:
contains natural speech patterns like hesitations, and words like “um” or “uh”
includes varied styles of speaking and speeds
may include dialects and everyday phrases that you wouldn’t find in written language
often contains overlapping speech, background noise, interruptions
gives a better picture of real-world communication
You must include spontaneous voice data if you want to develop robust speech technologies that can handle real-world situations, but it is more difficult to process.
Last updated
Was this helpful?