2.3 Read voice data versus spontaneous voice data

Two types of voice data are needed to develop effective . is used for the foundation. helps to make the system more robust and sound more natural.

Read voice data

Read voice data consists of recordings of speakers who read out a prepared text. This type of data:

  • is highly controlled and structured

  • has clear pronunciation and a steady pace

  • follows standardized text formats

When building speech technologies for a new language, read voice data is often the first type of data that you would collect. It provides clean, predictable input for developing speech technology.

It's important to understand the difference between reading a text aloud and more natural communication. When they read aloud, people usually speak:

  • more formally

  • with fewer grammar errors

  • with more complete sentences

  • in a more monotone voice

  • without using words like "um" or "uh"

  • without restarting sentences

  • following the rules of written language

This means that read speech can sometimes sound a bit artificial. Technologies that only use read speech may not work well in real-life situations.

Spontaneous voice data

Spontaneous voice data consists of natural speech without a script. Speakers may not finish their sentences and they use regional expressions. They make constant changes to their speech, responding to feedback from listeners. This type of data:

  • contains natural speech patterns like hesitations, and words like “um” or “uh”

  • includes varied styles of speaking and speeds

  • may include dialects and everyday phrases that you wouldn’t find in written language

  • often contains overlapping speech, background noise, interruptions

  • gives a better picture of real-world communication

You must include spontaneous voice data if you want to develop robust speech technologies that can handle real-world situations, but it is more difficult to process.

Last updated

Was this helpful?