1.2 TWB Voice: CLEAR Global’s platform for voice data collection
TWB Voice is a platform for collecting . It was set up by CLEAR Global in 2025 to collect and validate voice recordings from many contributors. These recordings help build voice technologies like ) and for .
You can access TWB Voice at twbvoice.org. You will need to be a member of the Translators without Borders Community or sign up for an account. You can contribute to active TWB Voice projects in your language.
The aim of TWB Voice is to:
support ethical, community-led voice data
create high-quality, diverse that will help to develop voice that reflect how people really speak
help people to access tools and information in their own language
On the platform, users can:
record voice clips
rate the voice recordings of others
audio and rate the transcriptions of others
TWB Voice was built on more than 10 years of experience working with the Translators without Borders Community. This is a global network of 100,000+ volunteer linguists. TWB Voice helps this community to contribute their voices to aid the development of in their own languages.
Interested in working with us or finding out more about TWB Voice?
We are looking for partners and voice data contributors to help TWB Voice to grow. We want to add more languages and improve the platform.
Get in touch to find out how we can work together.
Data collection pilot in Hausa, Kanuri, and Shuwa Arabic
In TWB Voice’s first data collection pilot, we collected conversational audio in Hausa, Kanuri, and Shuwa Arabic. CLEAR Global is already working with linguists and local partners in northeastern Nigeria to provide language services, do research, and develop . The pilot aimed to create 50–100 hours of speech data per language. In practice, we collected:
~68 hours of speech data in Hausa of which ~58 hours for Automatic Speech Recognition and 10 hours for Text-to-Speech
~62 hours of speech data in Kanuri of which ~52 hours for Automatic Speech Recognition and 10 hours for Text-to-Speech
~15 hours of speech data in Shuwa Arabic for Automatic Speech Recognition
This is available as an open-source dataset for Automatic Speech Recognition and as samples for Text-to-Speech.
CLEAR Global and partners also used this data to develop:
Automatic Speech Recognition models for Hausa and Kanuri
Text-to-Speech models for Hausa and Kanuri.
You can find the datasets and models published as part of this pilot in this repository on CLEAR Global’s Hugging Face page.
This pilot is a key case study and we refer to it throughout the playbook.
Last updated
Was this helpful?