7.4 Downloading data for publishing: criteria for recordings and fields included
Criteria to include a recording
There are two approaches to publishing data. The first approach is that only recordings that have:
gained approval through the review process
passed quality checks for content and technical issues
got the right consent
…can be included in datasets for the public to download.
The second option is to publish all the recordings that have got the right consent, but split them into subsets of “approved”, “rejected”, and “pending review”.
For the current project, we have followed the second approach.
Published metadata fields
Each recording includes anonymized that lists: language, language variant, year of birth, country, gender, and education level
Dataset access and licensing
All public datasets are hosted in gated repositories on Hugging Face. This ensures proper approval of the Dataset Access Terms. It also allows us to track downloads.
We deal with any requests for deletion quickly. We tell people who downloaded the data if the deletions will affect shared datasets.
We release datasets and models under a custom license where downloaders must accept the Dataset Access Terms first and then the dataset is provided under the terms of CC BY-NC 4.0. This allows non-commercial use with proper crediting.
Last updated
Was this helpful?