2.2 Data imbalance & low-resource languages

There is a serious imbalance in the available for different languages. This is a big problem for the development of speech technology:

High-resource languages

With languages like English, Mandarin, Spanish, and French, there are thousands of hours of voice data available. This results in very accurate speech technologies.

Mid-resource languages

With languages that have moderate populations of speakers or that are important in economic terms, there may be some commercial speech technology support. However, they still don’t have the amount of voice that are available for high-resource languages. Examples include Czech, Hindi, Polish, Tamil, Thai and Swahili.

Low-resource languages

For most of the world's 7,000+ languages, there is very little or no voice data available to develop technology. This is true even though millions of people speak these languages across the world. Kannada, Sindhi, and Tajik, for example, are “under-resourced” languages. Gondi, Khasi, and Santhali are “no-resource” languages as there is almost no data available for them.

The result of this imbalance is a "". Speakers of low-resource languages cannot make use of voice technologies. Many of the languages spoken in areas affected by disaster or conflict are low-resource languages.

Some of the results of this imbalance:

  • communities are not able to use digital services

  • people cannot access key information during crises

  • there is very little documentation of languages in danger of dying out

  • global inequality is made worse

Last updated

Was this helpful?