3.1 Designing your project

A project to collect voice data often involves the following stages:

Defining the scope and setup

Define your project’s goals, target languages and dialects. Plan how you will find and support contributors. Decide on tools and platforms. Set timelines.

Preparing a collection of sentences (text corpus)

If your project uses prompts, find or prepare enough text sentences to record on the data collection platform.

Data collection and checking quality (validation)

Find contributors whose voice you can record. Before you start collecting voice data, decide how you will divide up the prompts among contributors and set rules for recording (for example, recording duration, quality standards). Define your workflow for validating recordings.

Data processing

Download the recordings and get them ready for publication (for example, formatting, cleaning).

Publication and sharing

Prepare the dataset for release. Make sure it meets licensing and privacy requirements. Provide documentation for users.

You can find more details on these steps in later chapters of this playbook.

You will also need to think about including monitoring and evaluation throughout the project. You need to track your progress, get feedback from contributors, and check data quality. You will also need to evaluate the use and impact of the data after you have published it.

Case study: Defining the scope of CLEAR Global’s pilot project to collect voice data in Hausa, Kanuri, and Shuwa Arabic

The first projects in TWB Voice collected voice data in Hausa, Kanuri, and Shuwa Arabic. The aim was to build open-source voice datasets in these languages.

The project’s goals were:

to increase the amount of voice data available in Hausa, Kanuri, and Shuwa Arabic in order to train Automatic Speech Recognition (ASR) models,
to aid development of voice technology for specific use cases. For example, Hajiya, a text-based chatbot that CLEAR Global developed to help communities affected by crises in northeast Nigeria,
to work with communities to respond to user feedback in the TWB Voice platform and grow networks of contributors.

The project involved volunteer linguists from the TWB Community in northeast Nigeria.

Defining the scope of the pilot

With a 13-month timeline, targets were set to collect:

50-100 hours of ASR data per language
10-20 hours of single-speaker data for Text-to-Speech in Hausa and Kanuri

For the ASR datasets, the aim was to collect a wide variety of recordings and make the datasets openly available to others working to improve voice technology in these languages. The Hajiya chatbot was a useful example, but the priority here was to create datasets for broad use that included many dialects and speakers from different backgrounds. The speakers gave details of their dialect, age, gender, and education so that future users would understand the range of people included in the dataset.

In contrast, the TTS data collection was more focused, with the Hajiya chatbot as a specific example. During the chatbot’s development, community members commented that its text-only design made it hard to use for people with poor reading and writing skills. They suggested the kind of voice that would make the chatbot seem friendly and easy to use. The project chose the name Hajiya because it represents a respected, middle-aged woman and role model in Yobe, Borno, and Adamawa states of Nigeria. We chose voice artists for the Hausa and Kanuri TTS datasets that would match this persona.

Dealing with gaps in the body of available texts

One major challenge in defining the scope of the project was finding enough suitable texts for prompts. We found some sentences for Hausa and Kanuri during the project setup stage. However, we found very few for Shuwa Arabic—only about 1500 usable sentences with open licenses or owned by CLEAR Global. For all three languages, linguists created new prompts or changed existing ones so we had enough material relevant to the cultural context. Because the number of prompts was limited, we set up projects in TWB Voice so that up to three people could record each prompt.

Designing the workflow for quality checking (validation)

We only had a few hundred contributors but needed to be sure of good quality. So we designed a tiered system to check the quality of the recordings:

Contributors could listen back and self-check recordings before submitting them
If contributors had made at least one hour of voice recordings and had got 80% approval or higher, they were allowed to rate the recordings of others (tier 1).
Language Leads gained a “trusted linguist” status. This meant that we gave their ratings higher weighting (tier 2).

Previous3. Setting up a project to collect voice data Next3.2 Creating a work plan for voice data collection

Last updated 7 months ago

Was this helpful?