# 5.5 TTS versus ASR-oriented recording

[Voice data](#user-content-fn-1)[^1] can be used for different types of [language technology](#user-content-fn-2)[^2]. For [TWB Voice](#user-content-fn-3)[^3], it was helpful to understand the goal of the dataset[^4] (building TTS[^5] or ASR[^6]). We could then adapt our recording instructions and our criteria for validation.

### Text-to-Speech (TTS)

TTS models[^7] convert text into speech that sounds like a human. These models need a balanced dataset and the voices should be even and clear. Recordings must:

* be neutral and have a consistent tone
* use standard pronunciation and a normal speed of speaking
* avoid emotion and dramatic speech, and irregular speech patterns
* be done in a controlled space with no background noise, following strict quality instructions

### Automatic Speech Recognition (ASR)

ASR models convert spoken language into written text. For ASR, recordings should:

* be as natural and varied as possible, just like real-life conversations
* include lots of different accents, speeds, and genders
* include many types of speaker (age, gender, region)

| <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdJddA81-H9zrgDehPwnVrWjM8gq-AJriPlrCLYeKnQg3e8-v9sG9B8ylcO7__rgztu8LtMaCZbBrTZfrmoHjWWjDKhotegqrHXoM2oT7MN-vheqPMCsglWQQIa50Vhrmu_ursIBA?key=LjOaNqlneHjM8MYR-1Jh9w" alt=":bulb:" data-size="line"> <mark style="color:blue;">**Key takeaway**</mark> |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <p>High-quality recordings and fair validation will give you usable and high-quality voice datasets.</p><p></p><p>Use clear guidelines and checklists. Make sure your recordings match the goals of your project.</p>                                                                                            |

<br>

[^1]: **Voice data:** Audio recordings of human speech. These recordings capture the acoustic features of spoken language, such as pronunciation, speaking patterns, and rhythm.

[^2]: **Language Technology (LT):** Technologies that focus on human language, including both spoken and written language. They can process, understand, and generate language. Examples are the tools on your phone or computer that understand and generate words, like translation apps or voice assistants. They allow us to communicate and interact with our devices with language. When you talk to a virtual assistant like Siri or Alexa, or your phone suggests the next word in a message, that’s because of language technology. It makes technology more accessible and user-friendly.

[^3]: **TWB Voice:** A platform for collecting voice data. It was developed by CLEAR Global, who also own it. Users can make voice recordings to help with active data collection projects in TWB Voice by [signing up to the TWB Community](https://translatorswithoutborders.org/join-the-twb-community/). The main goal of TWB Voice is to help to develop voice technology for speakers of marginalized languages. For example, by creating the voice datasets that are needed to build language models for TTS and ASR.

[^4]: **Dataset:** A collection of information that has been organized for use. A **voice dataset** is a collection of voice recordings (paired with transcription) with additional information (metadata) such as gender, age of the person recording to give more information on how the data set is constructed and to avoid bias. It is for use in research and for training or improving voice models.

[^5]: **Text-to-Speech (TTS):** Technology that converts written text into spoken language. It can also generate audio speech. You can find TTS in accessibility tools like screen readers and in virtual assistants. It brings written texts to life through spoken words. If your phone reads out the messages you get, or an audiobook tells you your favorite story, this is because of TTS technology.

[^6]: **Automatic Speech Recognition (ASR) or Speech-to-Text (STT):** ASR converts spoken language into text. It can be used in voice assistants and transcription services, for example. You may hear both terms, ASR and STT, and they are almost the same. But STT can be a semi-manual process, while ASR is fully automated. ASR is like a smart listener that turns spoken words into written text on your device. While it does create the text automatically, it often needs some human input to make sure everything is correct, it can make mistakes especially in languages that are less used in the digital space.

[^7]: **Model:** A computer-based system that has made use of data to learn patterns. It can make predictions or generate language. In speech technology, models use voice data to learn to recognize speech (converting audio to text) or to create speech (converting text to audio).
