7.1 Data storage

Data flow

Currently, all users have to first sign up to TWB Platform. This is a platform that TWB developed for our community of linguists to use. They use it to take on translation jobs and for other language services too. Users have to accept the terms and conditions and privacy policy.

Once they have agreed and signed up to TWB Platform, contributors can also log into TWB Voice. They can use the same login credentials (email and password).

When they log in for the first time, they will need to accept some extra terms. These apply to the collection of voice data:

New users will also have to provide some further information. This information is relevant to the collection of voice data: gender, year of birth, education level, and language variant.

All of the data we collect is stored securely across multiple CLEAR Global databases.

When they have done this, users can start doing TWB Voice tasks. The tasks they can do will depend on their level of access.

Data segregation

  • We store personal information of users (name, email, gender, year of birth, education level, and language variant) separately from the user recordings. This means there is no risk that people could find out the identity of the speakers.

  • We only add the user (gender, year of birth, education level, and language variant) to the recording when we export the data. We never share the user’s name or email in the published .

  • We collect recordings for models through specific workflows. We store them in separate datasets from recordings for Text to Speech (TTS) models. We do this because we don’t publish TTS datasets fully, but use them internally to train models. We publish a partial set within the ASR dataset so that users can be anonymous.

Systems for storing and securing data

We store all voice recordings and metadata on secure servers. CLEAR Global manages and approves these servers. Our infrastructure ensures:

  • user-based access control, so access depends on the role of the user (e.g. reviewers, admins)

  • strong password and device policies are in place across all accounts

  • data encryption so that unauthorized persons cannot access or change the data

  • automated and encrypted backups on a regular schedule

  • server hardening and monitoring (firewall, operating system patches, minimal access configurations)

  • event logging to detect any unusual activity or misuse

Last updated

Was this helpful?