# 3.1 Designing your project

A project to collect [voice data](#user-content-fn-1)[^1] often involves the following stages:

{% stepper %}
{% step %}
**Defining the scope and setup**

Define your project’s goals, [target languages](#user-content-fn-2)[^2] and dialects. Plan how you will find and support contributors. Decide on tools and platforms. Set timelines.
{% endstep %}

{% step %}
**Preparing a collection of sentences (text corpus)**

If your project uses prompts, find or prepare enough text sentences to record on the data collection platform.
{% endstep %}

{% step %}
**Data collection and checking quality (validation)**

Find contributors whose voice you can record. Before you start collecting voice data, decide how you will divide up the prompts among contributors and set rules for recording (for example, recording duration, quality standards). Define your workflow for validating recordings.
{% endstep %}

{% step %}
**Data processing**

Download the recordings and get them ready for publication (for example, formatting, cleaning).
{% endstep %}

{% step %}
**Publication and sharing**

Prepare the dataset[^3] for release. Make sure it meets licensing and privacy requirements. Provide documentation for users.
{% endstep %}
{% endstepper %}

You can find more details on these steps in later chapters of this playbook.

You will also need to think about including monitoring and evaluation throughout the project. You need to track your progress, get feedback from contributors, and check data quality. You will also need to evaluate the use and impact of the data after you have published it.

| <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXedPK8NKs625txxlFmXTt0YZX3yKg9QPd118ymmfVwjXrd3W_hHNObPNJ5ff0HOWoio9NtqQOL0eIzTVH2B6jCBLB3A6npLptGqxqPrZDT19fK9BIX0ecJ3_9txEQzAhPKZ4JcP?key=LjOaNqlneHjM8MYR-1Jh9w" alt=":magnifying_glass:" data-size="line">  <mark style="color:blue;">**Case study: Defining the scope of CLEAR Global’s pilot project to collect voice data in Hausa, Kanuri, and Shuwa Arabic**</mark>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <p>The first projects in <a data-footnote-ref href="#user-content-fn-4">TWB Voice</a> collected voice data in Hausa, Kanuri, and Shuwa Arabic. The aim was to build open-source voice datasets in these languages.</p><p></p><p>The project’s goals were:</p><ul><li>to increase the amount of voice data available in Hausa, Kanuri, and Shuwa Arabic in order to train <a data-footnote-ref href="#user-content-fn-5">Automatic Speech Recognition (ASR)</a> <a data-footnote-ref href="#user-content-fn-6">models</a>,</li><li>to aid development of <a data-footnote-ref href="#user-content-fn-7">voice technology</a> for specific use cases. For example, Hajiya, a text-based chatbot that CLEAR Global developed to help communities affected by crises in northeast Nigeria,</li><li>to work with communities to respond to user feedback in the TWB Voice platform and grow networks of contributors.</li></ul><p>The project involved volunteer linguists from the <a data-footnote-ref href="#user-content-fn-8">TWB Community</a> in northeast Nigeria.</p><p></p><p><strong>Defining the scope of the pilot</strong></p><p></p><p>With a 13-month timeline, targets were set to collect:</p><p></p><ul><li><strong>50-100</strong> <strong>hours</strong> of ASR data per language</li><li><strong>10-20</strong> <strong>hours</strong> of single-speaker data for <a data-footnote-ref href="#user-content-fn-9">Text-to-Speech</a> in Hausa and Kanuri</li></ul><p></p><p>For the ASR datasets, the aim was to collect a wide variety of recordings and make the datasets openly available to others working to improve voice technology in these languages. The Hajiya chatbot was a useful example, but the priority here was to create datasets for broad use that included many dialects and speakers from different backgrounds. The speakers gave details of their dialect, age, gender, and education so that future users would understand the range of people included in the dataset.</p><p></p><p>In contrast, the TTS data collection was more focused, with the Hajiya chatbot as a specific example. During the chatbot’s development, <a data-footnote-ref href="#user-content-fn-10">community members</a> commented that its text-only design made it hard to use for people with poor reading and writing skills. They suggested the kind of voice that would make the chatbot seem friendly and easy to use. The project chose the name Hajiya because it represents a respected, middle-aged woman and role model in Yobe, Borno, and Adamawa states of Nigeria. We chose voice artists for the Hausa and Kanuri TTS datasets that would match this persona.</p><p></p><p><strong>Dealing with gaps in the body of available texts</strong></p><p></p><p>One major challenge in defining the scope of the project was finding enough suitable texts for prompts. We found some sentences for Hausa and Kanuri during the project setup stage. However, we found very few for Shuwa Arabic—only about 1500 usable sentences with open licenses or owned by CLEAR Global. For all three languages, linguists created new prompts or changed existing ones so we had enough material relevant to the cultural context. Because the number of prompts was limited, we set up projects in TWB Voice so that up to three people could record each prompt.</p><p></p><p><strong>Designing the workflow for quality checking (validation)</strong></p><p></p><p>We only had a few hundred contributors but needed to be sure of good quality. So we designed a tiered system to check the quality of the recordings:</p><ul><li>Contributors could listen back and self-check recordings before submitting them</li><li>If contributors had made at least one hour of voice recordings and had got 80% approval or higher, they were allowed to rate the recordings of others (tier 1).</li><li><a data-footnote-ref href="#user-content-fn-11">Language Leads</a> gained a “trusted linguist” status. This meant that we gave their <a data-footnote-ref href="#user-content-fn-12">ratings</a> higher weighting (tier 2).</li></ul> |

<br>

[^1]: **Voice data:** Audio recordings of human speech, usually paired with a transcription. These recordings capture the acoustic features of spoken language, such as pronunciation, speaking patterns, and rhythm.

[^2]: **Target language:** The language(s) for which you are collecting the voice data.

[^3]: **Dataset:** A collection of information that has been organized for use. A **voice dataset** is a collection of voice recordings (paired with transcription) with additional information (metadata) such as gender, age of the person recording to give more information on how the data set is constructed and to avoid bias. It is for use in research and for training or improving voice models.

[^4]: **TWB Voice:** A platform for collecting voice data. It was developed by CLEAR Global, who also own it. Users can make voice recordings to help with active data collection projects in TWB Voice by [signing up to the TWB Community](https://translatorswithoutborders.org/join-the-twb-community/). The main goal of TWB Voice is to help to develop voice technology for speakers of marginalized languages. For example, by creating the voice datasets that are needed to build language models for TTS and ASR.

[^5]: **Automatic Speech Recognition (ASR) or Speech-to-Text (STT):** ASR converts spoken language into text. It can be used in voice assistants and transcription services, for example. You may hear both terms, ASR and STT, and they are almost the same. But STT can be a semi-manual process, while ASR is fully automated. ASR is like a smart listener that turns spoken words into written text on your device. While it does create the text automatically, it often needs some human input to make sure everything is correct, it can make mistakes especially in languages that are less used in the digital space.

[^6]: **Model:** A computer-based system that has made use of data to learn patterns. It can make predictions or generate language. In speech technology, models use voice data to learn to recognize speech (converting audio to text) or to create speech (converting text to audio).

[^7]: **Voice technology:** Language technology that processes or generates spoken words.

[^8]: **TWB Community:** A global community of over 100,000 linguists. They donate their time and skills to translate information for millions of people around the world. They help people to get the information they need and want.

[^9]: **Text-to-Speech (TTS):** Technology that converts written text into spoken language. It can also generate audio speech. You can find TTS in accessibility tools like screen readers and in virtual assistants. It brings written texts to life through spoken words. If your phone reads out the messages you get, or an audiobook tells you your favorite story, this is because of TTS technology.

[^10]: **Community members:** People at the center of a project who contribute by recording and validating data. In this context, community refers to their shared interest in contributing to data collection and the languages they speak. Community members are well positioned to provide feedback on linguistic aspects and offer input on the project's direction and needs.

[^11]: **Language Leads (LLs):** Language experts with lots of experience. They can do spot checks to ensure quality and can give feedback.

[^12]: **Rating:** An individual quality assessment by a contributor. They either approve or reject a piece of language data (recording in the case of voice data) and base their rating on the pass/fail guidelines. Several ratings may be needed to decide whether this data has passed or failed for quality.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://twbvoiceplaybook.clearglobal.org/3.-setting-up-a-project-to-collect-voice-data/3.1-designing-your-project.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
