8.1 Sharing data on Hugging Face and platform access

What is Hugging Face?

Hugging Face is a popular online platform where researchers and developers share AI models, datasets, and applications. It’s like a library for artificial intelligence resources that many people can use. The platform makes it easy to find, download, and use machine learning models and datasets. These are created by organizations and individuals worldwide.

Accessing TWB Voice datasets

You can find our datasets, models and demos on the CLEAR Global organization page: huggingface.co/CLEAR-Global

We publish each dataset with lots of metadata, including number of people taking part, total hours of recordings, and license information. We usually release our datasets under Creative Commons or OpenRAIL licenses. These allow broad use, but make sure that people are credited properly, and provide guidelines on use.

We publish models with information on their architecture and on their evaluation on a test partition of the data we publish.

You can download the data through the web interface. Or you can use the Hugging Face datasets Python module to access them programmatically for your projects.

Important limitations and things to consider

The models we share on Hugging Face are mostly research demonstrations and evaluation tools. They have been tested on specific datasets and may not work well in all real-world settings. These models are not guaranteed for production use. You would need to test them thoroughly before using them in any active humanitarian technology system.

When using our models, consider the following:

Performance may vary a lot with different accents, recording conditions, or speaking styles.
Models are trained on limited data so they may not represent the full range of speakers in a language community.
For any practical application, they will need regular testing and adjustment.

We suggest that users see these models as starting points for further development. They are not complete solutions that could be put to immediate use in critical humanitarian settings.

Gated datasets

Some TWB Voice datasets are configured as "gated" on Hugging Face. This means you need to log in to the platform and request access before you can download them. We choose to gate some datasets so we can keep track of who is using our data. We can also make sure they use it responsibly. This helps us understand the impact of our work. We can also be accountable to the communities who record their voices for us. When you access gated datasets, you'll need to give some basic information about your intended use. Access is then usually granted automatically.

Previous8. Access to datasets and models Next9. Share your feedback

Last updated 7 months ago

Was this helpful?

hashtagWhat is Hugging Face?

hashtagAccessing TWB Voice datasets

hashtagImportant limitations and things to consider

hashtagGated datasets

What is Hugging Face?

Accessing TWB Voice datasets

Important limitations and things to consider

Gated datasets