7.4 Downloading data for publishing: criteria for recordings and fields included

Criteria to include a recording

There are two approaches to publishing data. The first approach is that only recordings that have:

  • gained approval through the review process

  • passed quality checks for content and technical issues

  • got the right consent

…can be included in datasets for the public to download.

The second option is to publish all the recordings that have got the right consent, but split them into subsets of “approved”, “rejected”, and “pending review”.

For the current project, we have followed the second approach.

Published metadata fields

Each recording includes anonymized that lists: language, language variant, year of birth, country, gender, and education level

Public releases do not contain anything that could identify a particular person (names, emails, usernames).

Dataset access and licensing

  • All public datasets are hosted in gated repositories on Hugging Face. This ensures proper approval of the Dataset Access Terms. It also allows us to track downloads.

  • We deal with any requests for deletion quickly. We tell people who downloaded the data if the deletions will affect shared datasets.

  • We release datasets and models under a custom license where downloaders must accept the Dataset Access Terms first and then the dataset is provided under the terms of CC BY-NC 4.0. This allows non-commercial use with proper crediting.

Last updated

Was this helpful?