2.2 Data imbalance & low-resource languages

There is a serious imbalance in the voice data available for different languages. This is a big problem for the development of speech technology:

High-resource languages

With languages like English, Mandarin, Spanish, and French, there are thousands of hours of voice data available. This results in very accurate speech technologies.

Mid-resource languages

With languages that have moderate populations of speakers or that are important in economic terms, there may be some commercial speech technology support. However, they still don’t have the amount of voice datasets that are available for high-resource languages. Examples include Czech, Hindi, Polish, Tamil, Thai and Swahili.

Low-resource languages

For most of the world's 7,000+ languages, there is very little or no voice data available to develop technology. This is true even though millions of people speak these languages across the world. Kannada, Sindhi, and Tajik, for example, are “under-resourced” languages. Gondi, Khasi, and Santhali are “no-resource” languages as there is almost no data available for them.

The result of this imbalance is a "digital language divide". Speakers of low-resource languages cannot make use of voice technologies. Many of the languages spoken in areas affected by disaster or conflict are low-resource languages.

Some of the results of this imbalance:

communities are not able to use digital services
people cannot access key information during crises
there is very little documentation of languages in danger of dying out
global inequality is made worse

Previous2.1 Technologies that use voice data Next2.3 Read voice data versus spontaneous voice data

Last updated 7 months ago

Was this helpful?