4.1 Where can I find sentences for my language/domain?

Understanding domains and content sources

What do we mean by "" when collecting voice data?

Domain means the subject area or context in which the language is used. For example, healthcare, farming, and legal services are all different domains. They each have their own terminology and language patterns. If you focus on the relevant domain, you can be sure that:

  1. the resulting voice will be able to handle terminology from your use case and will be accurate,

  2. the vocabulary used reflects how people actually speak in real-life situations,

  3. the models will be well suited to the intended use, so their impact will be stronger

Collecting data in a domain-specific way is very important for languages that are not widely spoken as resources for these languages may be limited.

Digital versus non-digital sources

  • Digital sources: websites, social media, existing collections of digital texts, and content that has already been digitized

  • Non-digital sources: printed materials, books, leaflets, radio transcripts, community knowledge that you could

It is far more efficient to use digitized text, but there are not many digital sources for languages that are not widely spoken. You may need to convert non-digital sources if you want to record authentic use of language in your community. This will mean extra work.

Repurposing existing content

One of the most efficient methods is to repurpose content that you've already developed or have access to:

  • Previous projects: past publications, translations, or localization projects often contain sentences in your that have been checked for quality

  • Communications materials: materials created for community outreach, for education, or for information campaigns

  • Content generated by users: questions, answers, and feedback that you have collected through your services (check permissions)

Online sources

There are various sources of sentences you could use on the internet. Availability and permissions to use them vary a lot by language and domain:

  • Wikipedia and other wikis: Articles in your language can provide informative sentences with a good structure. Wikipedia uses a permissive license so you’re usually allowed to use content from there as prompts.

  • News websites: Local news outlets often publish content in regional languages. But most of these websites are copyrighted and you can’t use content without their permission. In such cases, you could contact the publisher to explain your project and ask for permission. There are also some exceptions. For example, VOA News publishes their content in the “public domain”.

  • Government websites: Official communications, records of discussions in parliament, health information, and public service announcements.

  • Religious texts: Materials that have been widely translated, like scriptures, are available in many languages. But the language in these texts is often old-fashioned or formal. It may not be the same as the way people speak nowadays. You will also need to use more modern sources if you want to reflect everyday language use.

  • Common Voice: Mozilla's Common Voice project has collections of sentences for many different languages.

  • Language learning resources: Textbooks and language courses contain practical, everyday sentences.

If you use online sources, always check the licensing terms and say where the content came from.

Community-based collection

For languages with limited written resources, it is very helpful to work directly with or linguists:

  • Community workshops: Bring together speakers to generate sentences about their daily lives and local topics

  • Storytelling sessions: Record and transcribe traditional stories (remember that transcription means extra work). You can then process these recordings to increase the amount of voice data you have.

  • Surveys and interviews: Ask community members about specific topics that are relevant to your domain.

  • Expert consultation: Work with linguists or experts in your domain to generate terminology and phrases that relate to your domain.

Remember that generating new content means extra work. But you may need to do this if your language has only limited resources.

Domain-specific resources

If your domain is specialized, there are some targeted resources you can use:

  • Healthcare: medical pamphlets, public health announcements, transcripts of conversations between doctors and patients (get permission and take out any personal information)

  • Agriculture: farming guides, weather advice, instructions on crop management

  • Education: textbooks, lesson plans, transcripts of educational videos

  • Legal/administrative: public legal documents, forms, guidelines on procedures

  • Technology: user manuals, scripts for tech support, education materials for technology subjects

Make sure your collection is well-balanced

Try to get a wide variety in your sentence collection::

  • Varying length: include short, medium, and longer sentences. Keep in mind limitations on recording time. Aim at sentences of 15-20 seconds maximum.

  • Sentence types: statements, questions, exclamations, and commands

  • Grammar structure: different tenses, types of sentence, and levels of complexity

  • Range of different speakers: content from different age groups, genders, and backgrounds

  • Variety of situations: formal and informal language, different social settings, sounds that people make (like “Oh!” and “Aha” in English)

When organizing your collection, use a clear system:

  • Create spreadsheets with separate columns for original sentences and normalized versions (explained later) and notes, where necessary

  • Make a note of the sources and authors of your sentences so you can be sure to credit them

  • Track like domain categories, sentence length, and completion status

  • Set clear targets for how many sentences you need in each category

Key tips: How many sentences do I need?

The ideal number of sentences depends on the size, resources, and goals of your project. For reference:

  • 1,000 sentences with an average length of 7 seconds will give you about 2 hours of recorded speech.

  • For ASR development, the smallest usable would be at least 10 hours (roughly 10,000-20,000 sentences). More complex models may need 100+ hours.

  • TTS systems can work at a basic level with fewer sentences (5,000-10,000) if they're recorded by a single professional speaker in controlled conditions.

If you have access to plenty of sources of digital text data, collect as many sentences as you can. If you are collecting texts manually, there will be practical limitations so you will need to plan carefully.

Ideally, each sentence should be recorded once by a single speaker. But if you only have a limited collection of sentences, you can get multiple speakers to record the same sentences. This approach is helpful when you are having problems creating a wide variety of content but you still need a lot of voice data—made possible by having a large number of contributors.

Last updated

Was this helpful?