# 1. Introduction

### Welcome to the playbook for voice data collection for low-resource languages!

This playbook will help you to plan and manage projects to collect [voice data](#user-content-fn-1)[^1] for [low-resource languages](#user-content-fn-2)[^2]. It is aimed at both new and experienced teams and covers the full process, from setting up the project to publishing your dataset[^3]. We draw on CLEAR Global’s experience with our data collection platform “[TWB Voice](#user-content-fn-4)[^4]”. The playbook also covers aspects of data collection that apply to organizations and communities who want to collect voice data through other platforms or initiatives.

Around four billion people lack access to voice technologies like speech recognition and conversational AI. This is because their languages don’t have the data available to build these tools. This playbook aims to help address this gap by outlining the key steps, challenges, and best practices for collecting voice data in low-resource languages in an effective and ethical way.

In this chapter, we introduce TWB Voice, CLEAR Global’s new platform for collecting voice data. We refer to TWB Voice throughout this playbook. We also explain key terms and concepts in [voice technology](#user-content-fn-5)[^5], and show you how to use the playbook.

| <mark style="color:blue;">**How CLEAR Global can help**</mark>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <p>CLEAR Global’s mission is to help people get vital information and be heard, whatever language they speak. We help our partner organizations to listen to the communities they work with and communicate with them effectively.</p><p></p><p>Our tech-focused work helps organizations to find use cases where language technology could help users get more actively engaged and scale up communications efforts. We develop language AI solutions such as chatbots, machine translation, and speech solutions for low-resource languages. These are languages that don’t have enough data to create such language solutions. We also work with partners to help them collect voice data to build the resources for these technologies.</p><p></p><p>CLEAR Global’s user experience (UX) team can help with user research and UX design, and advise on human-centered design to tech interventions.</p><p></p><p>Our Language Services team can translate messages and documents into local languages, help with audio translations and pictures, train staff and volunteers, and give advice on two-way communication. We also work with partners to field test materials and make them easier to understand so they will have more impact. This work is backed up by research and language mapping to assess the communication needs of target populations.</p><p></p><p>For more information, go to our <a href="http://clearglobal.org">website</a> or contact us at <a href="mailto:info@clearglobal.org"><info@clearglobal.org></a>.</p> |

[^1]: **Voice data:** Audio recordings of human speech. These recordings capture the acoustic features of spoken language, such as pronunciation, speaking patterns, and rhythm.

[^2]: **Low-resource language:** A language that has limited written or recorded materials and is rarely found in digital tools, data, or technology. This means that technologies like speech recognition or machine translation are not available for the language and are difficult to build. This may be due to a lack of written content, digital resources (like websites or videos), or support from organizations that develop language technology.

[^3]: **Dataset:** A collection of information that has been organized for use. A **voice dataset** is a collection of voice recordings (paired with transcription) with additional information (metadata) such as gender, age of the person recording to give more information on how the data set is constructed and to avoid bias. It is for use in research and for training or improving voice models.

[^4]: **TWB Voice:** A platform for collecting voice data. It was developed by CLEAR Global, who also own it. Users can make voice recordings to help with active data collection projects in TWB Voice by [signing up to the TWB Community](https://translatorswithoutborders.org/join-the-twb-community/). The main goal of TWB Voice is to help to develop voice technology for speakers of marginalized languages. For example, by creating the voice datasets that are needed to build language models for TTS and ASR.

[^5]: **Voice technology:** Language technology that processes or generates spoken words.
