1.3 Glossary of key terms

In this section, we explain the most common terms and concepts that we use in voice data collection and voice technology, to help you navigate and understand this playbook.

Artificial Intelligence (AI): Artificial Intelligence (AI) refers to computer systems that can do tasks that normally require the intelligence of a human. Simply put: AI allows computers to do things that usually humans can do.

Automatic Speech Recognition (ASR) or Speech-to-Text (STT): ASR converts spoken language into text. It can be used in voice assistants and transcription services, for example. You may hear both terms, ASR and STT, and they are almost the same. But STT can be a semi-manual process, while ASR is fully automated. ASR is like a smart listener that turns spoken words into written text on your device. While it does create the text automatically, it often needs some human input to make sure everything is correct, it can make mistakes especially in languages that are less used in the digital space.

Community members: People at the center of a project who contribute by recording and validating data. In this context, community refers to their shared interest in contributing to data collection and the languages they speak. Community members are well positioned to provide feedback on linguistic aspects and offer input on the project's direction and needs.

Consent form: A document that voice contributors are presented with before they start recording. It explains how their data will be collected and used. If they agree to the terms and sign the form, they can begin contributing. The consent form helps ensure that contributors understand any risks, know their rights, and can make an informed choice about sharing their voice data.

Contributor profiles: Categories of people who are suitable for making recordings when you are collecting voice data. A category may be based on native language, bilingual ability or level of linguistic expertise.

Corpus: A collection of text or speech data that you can use to train language technology models.

Creative Commons (CC) licenses: A set of standard copyright licenses that let the creator of a piece of work choose which rights they want to keep and which they will give up.

Data controller: An individual or organization that is responsible for processing personal data. They decide why and how they will process the data. The General Data Protection Regulation states that data controllers have to make sure that data processing is lawful, transparent, and secure. They also have to respect the rights of the people whose data they process.

Data incentive: This means giving internet data to certain contributors to motivate them but also to ensure contributors have the required data to share their voice recordings over the internet.

Data segregation: Storing different types of data in separate systems or datasets. This keeps clear boundaries between their various uses. And you can make sure only the right people can gain access to the data. This means you can respect the choices of people who have provided their data.

Dataset: A collection of information that has been organized for use. A voice dataset is a collection of voice recordings (paired with transcription) with additional information (metadata) such as gender, age of the person recording to give more information on how the data set is constructed and to avoid bias. It is for use in research and for training or improving voice models.

Digital language divide: When it comes to access to technology, there is a big gap between people who speak high-resource languages and those who speak low-resource languages. It means that many people cannot enjoy the benefits of digitalization.

Domain: The subject area, topic or setting in which we use language. Examples are healthcare, farming, or education. Each domain has its own special terms and language patterns.

Free-form prompts: Open-ended questions or instructions that result in natural, free speech.

The General Data Protection Regulation (GDPR): A European Union law that deals with personal data. It tells us how to collect, process, store, and share data. It makes sure people have control over their personal data. It also ensures that organizations handle such data in a transparent, secure, and lawful way.

Hugging Face: An online platform where researchers and developers share AI models, datasets, and applications.

Language Leads (LLs): Language experts with lots of experience. They can do spot checks to ensure quality and can give feedback.

Language Technology (LT): Technologies that focus on human language, including both spoken and written language. They can process, understand, and generate language. Examples are the tools on your phone or computer that understand and generate words, like translation apps or voice assistants. They allow us to communicate and interact with our devices with language. When you record a message and the device transforms it into text, or your phone suggests the next word in a message, that’s because of language technology. It makes digital tools more accessible, interactive and sometimes more efficient.

Leaderboard: A table view that displays the performance of all contributors and allows them to compare how well they are doing against others in real time.

Low-resource language: A language that has limited written or recorded materials and is rarely found in digital tools, data, or technology. This means that technologies like speech recognition or machine translation are not available for the language and are difficult to build. This may be due to a lack of written content, digital resources (like websites or videos), or support from organizations that develop language technology.

Metadata: Extra information that gives some background to a dataset or parts of a dataset. Examples are demographics of the speakers (age, gender, accent), recording conditions (microphone type, level of background noise), or language-related details (dialect, speed of speaking, emotional tone).

Middle-resource language: Languages with medium-sized populations of speakers or that are moderately important in economic terms. They may have more resources in the form of documents, online presence to support commercial speech technology, but still don’t have enough voice datasets or a wide enough range to build quality solutions.

Model: A computer-based system that has made use of data to learn patterns. It can make predictions or generate language. In speech technology, models use voice data to learn to recognize speech (converting audio to text) or to create speech (converting text to audio).

Normalization: Putting text in a standard format that is suitable for speech applications (for example, writing out numbers in words, no abbreviations).

Open and Responsible AI License (OpenRAIL): A licensing framework that allows wide use but also provides responsible guidelines for the use of AI.

Optical Character Recognition (OCR): Technology that converts pictures of written text (like scanned documents) into text that a computer can read. This makes it possible to take content from printed materials and process it digitally.

Programmatically: Accessing data through code and programming interfaces rather than by manually downloading it.

Rating: An individual quality assessment by a contributor. They either approve or reject a piece of language data (recording in the case of voice data) and base their rating on the pass/fail guidelines. Several ratings may be needed to decide whether this data has passed or failed for quality.

Read prompt: A sentence that contributors read aloud from a written prompt. These sentences will give you controlled samples of speech.

Recognition program: A system of rewards (points, certificates, or money) to motivate contributors and acknowledge the importance of their efforts.

Spontaneous voice data: Recordings of natural speech with no script. These will contain hesitations, restarts, varied speaking patterns, and everyday phrases. They sound more like the way people actually talk.

Spot checking: Random checking of recordings/ratings for quality to to make sure they are consistent.

Target language: The language(s) for which you are collecting the voice data.

Test partition: A separate portion of data that is used to find out how well a model is working.

Text-to-Speech (TTS): Technology that converts written text into spoken language. It can also generate audio speech. You can find TTS in accessibility tools like screen readers and in virtual assistants. It brings written texts to life through spoken words. If your phone reads out the messages you get, or an audiobook tells you your favorite story, this is because of TTS technology.

Training: The process of teaching a computer-based model to recognize patterns by showing it large amounts of data as examples.

Transcribe: Converting spoken words into written text.

Translators without Borders (TWB): A part of CLEAR Global, a non-profit organization that helps people to get information and have a voice, whatever language they speak.

TWB Community: A global community of over 100,000 linguists. They donate their time and skills to translate information for millions of people around the world. They help people to get the information they need and want.

TWB Voice: A platform for collecting voice data. It was developed by CLEAR Global, who also own it. Users can make voice recordings to help with active data collection projects in TWB Voice by signing up to the TWB Community. The main goal of TWB Voice is to help to develop voice technology for speakers of marginalized languages. For example, by creating the voice datasets that are needed to build language models for TTS and ASR.

Validation by rating: Listening to audio recordings to check them against the set criteria. For example, exact match to a specific text, no pauses, and clear speech. The user may then decide that a recording is accepted, rejected, or needs further validation, if they have the right permissions.

Voice data: Audio recordings of human speech. These recordings capture the acoustic features of spoken language, such as pronunciation, speaking patterns, and rhythm.

Voice data collection: Gathering recordings of speech with their transcriptions in a systematic and ethical way. Also involves collecting demographic data (age, gender, accent) and for Automatic Speech Recognition should include a range of speakers. The voice data is used in research and for training or developing voice language models.

Voice technology: Language technology that processes or generates spoken words.

Last updated

Was this helpful?