> For the complete documentation index, see [llms.txt](https://twbvoiceplaybook.clearglobal.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://twbvoiceplaybook.clearglobal.org/3.-setting-up-a-project-to-collect-voice-data.md).

# 3. Setting up a project to collect voice data

#### <mark style="color:blue;">Chapter 3 overview:</mark>

{% hint style="info" %}
This chapter is for project coordinators, organizations, and linguists who want to set up a project to collect voice data. It explains how to:

* **define the scope of a project** in a new language or domain
* **design your project** and make a work plan
* **build your project team**.

You don’t need to be a technical expert to read this chapter. It is designed for anyone who wants to set up a project.
{% endhint %}

###

### Why set up a project to collect voice data?&#x20;

Before you start a project to collect voice data, you need to define your goals. This is to make sure that the data will be useful. It is especially important in [low-resource language](#user-content-fn-1)[^1] projects. This is because if there is very little [voice data](#user-content-fn-2)[^2] available, you need to make sure that you don’t collect data that is not useful or relevant.

Here is a typical list of questions to ask:

* What are the main goals of your project?
* Who will it help?
* What problem do you want to solve?
* How will the data be used?
* How will your project support and empower the language community involved?

The answers will help you to design your project. For example:

* If you plan to use the datasets[^3] to build tools like speech recognition, the data must meet technical needs, such as language variant, quality, and metadata[^4].
* If you plan to use the datasets to build a speech solution for a specific use case, the data should match the context. For example, if you want to build a voice assistant for farmers, you may need to collect data in rural dialects, on topics relating to farming, and in real-world outdoor settings.

[^1]: **Low-resource language:** A language that has limited written or recorded materials and is rarely found in digital tools, data, or technology. This means that technologies like speech recognition or machine translation are not available for the language and are difficult to build. This may be due to a lack of written content, digital resources (like websites or videos), or support from organizations that develop language technology.

[^2]: **Voice data:** Audio recordings of human speech. These recordings capture the acoustic features of spoken language, such as pronunciation, speaking patterns, and rhythm.

[^3]: **Dataset:** A collection of information that has been organized for use. A **voice dataset** is a collection of voice recordings (paired with transcription) with additional information (metadata) such as gender, age of the person recording to give more information on how the data set is constructed and to avoid bias. It is for use in research and for training or improving voice models.

[^4]: **Metadata:** Extra information that gives some background to a dataset or parts of a dataset. Examples are demographics of the speakers (age, gender, accent), recording conditions (microphone type, level of background noise), or language-related details (dialect, speed of speaking, emotional tone).