# 2. What is voice data?

#### <mark style="color:blue;">Chapter 2 overview:</mark>&#x20;

{% hint style="info" %}
You don’t need to be a technical expert for this chapter. It’s for anyone who wants to understand the basics of voice technology. It includes:

* an introduction to **voice data**—the basis for all speech technologies
* a look at the **main types of speech technology** (TTS and ASR), and why some languages don’t have voice technology support
* the difference between **read speech** and **spontaneous speech**.&#x20;
  {% endhint %}

[Voice data](#user-content-fn-1)[^1] consists of audio recordings of human speech, ideally paired with transcriptions of what the person said and relevant metadata[^2]. We need it to develop language technologies that can process, understand, or generate spoken language.&#x20;

Each recording in a [voice data collection](#user-content-fn-3)[^3] "trains" machines to process human speech and teaches them to recognize or generate speech. We collect voice recordings and their transcriptions as examples that will help algorithms to build models[^4].  These models are computer-based systems that can recognize patterns in human speech. They analyze thousands or millions of speech samples to find patterns between speech sounds and words or phrases. Voice data that comes from a wide range of speakers helps the models to get better at recognizing or generating speech. They learn to do this for different accents, speaking styles, and acoustic settings.

<br>

[^1]: **Voice data:** Audio recordings of human speech. These recordings capture the acoustic features of spoken language, such as pronunciation, speaking patterns, and rhythm.

[^2]: **Metadata:** Extra information that gives some background to a dataset or parts of a dataset. Examples are demographics of the speakers (age, gender, accent), recording conditions (microphone type, level of background noise), or language-related details (dialect, speed of speaking, emotional tone).

[^3]: **Voice data collection:** Gathering recordings of speech with their transcriptions in a systematic and ethical way. Also involves collecting demographic data (age, gender, accent) and for Automatic Speech Recognition should include a range of speakers. The voice data is used in research and for training or developing voice language models.

[^4]: **Model:** A computer-based system that has made use of data to learn patterns. It can make predictions or generate language. In speech technology, models use voice data to learn to recognize speech (converting audio to text) or to create speech (converting text to audio).
