> For the complete documentation index, see [llms.txt](https://twbvoiceplaybook.clearglobal.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://twbvoiceplaybook.clearglobal.org/8.-access-to-datasets-and-models.md).

# 8. Access to datasets and models

#### <mark style="color:blue;">Chapter 8 overview:</mark>

{% hint style="info" %}
This section is for researchers, developers, humanitarian technology practitioners, and program managers who need to access and use TWB Voice datasets and models.

We cover:

* how to use Hugging Face as our main platform for sharing data
* how to understand different types of licensing
* how to access open and gated datasets
* how to work with pre-trained speech AI models.&#x20;

For most of this section, you don’t need much technical expertise at all. It focuses on practical guidance to help you find and download resources. But if you are planning to use these datasets and models in your own projects, it would be good to have some basic knowledge of programming. This is particularly true with programmatic data access through Python libraries.
{% endhint %}

[TWB Voice ](#user-content-fn-1)[^1]creates useful speech datasets[^2] and AI models[^3]. We aim to help the broader world of research and the humanitarian technology community. In this section, we explain how to access and use these resources through our main data sharing platform, [Hugging Face](#user-content-fn-4)[^4].

[^1]: **TWB Voice:** A platform for collecting voice data. It was developed by CLEAR Global, who also own it. Users can make voice recordings to help with active data collection projects in TWB Voice by [signing up to the TWB Community](https://translatorswithoutborders.org/join-the-twb-community/). The main goal of TWB Voice is to help to develop voice technology for speakers of marginalized languages. For example, by creating the voice datasets that are needed to build language models for TTS and ASR.

[^2]: **Dataset:** A collection of information that has been organized for use. A **voice dataset** is a collection of voice recordings (paired with transcription) with additional information (metadata) such as gender, age of the person recording to give more information on how the data set is constructed and to avoid bias. It is for use in research and for training or improving voice models.

[^3]: **Model:** A computer-based system that has made use of data to learn patterns. It can make predictions or generate language. In speech technology, models use voice data to learn to recognize speech (converting audio to text) or to create speech (converting text to audio).

[^4]: **Hugging Face:** An online platform where researchers and developers share AI models, datasets, and applications.