4.4 Reviewing the sentences
Automated text processing is powerful but not perfect. Even the most sophisticated systems can produce sentences that are not suitable for voice recording. This is why having the sentences reviewed by a human is a key step in building a high-quality collection of sentences.
The importance of human review
With automated extraction, you often get sentences that look okay at first glance but have subtle issues. For example, a sentence might be complete in terms of grammar but may include content that is culturally sensitive. Or it might use vocabulary from a different variant of the language. We need human judgement to identify and correct these issues.
Setting up a review process
A clear and simple approach is to put the sentences into a spreadsheet with headed columns for the review process. For example:
Original Sentence
Revised Sentence
Normalized Version
Comments
Status
"The temp. was 32°C yesterday."
"The temperature was 32 degrees Celsius yesterday."
"The temperature was thirty-two degrees Celsius yesterday."
Expanded abbreviation and symbols
Approved
"Do NOT touch—dangerous!"
"Do not touch. It is dangerous."
"Do not touch. It is dangerous."
Fixed formatting and capitalization
Approved
"@everyone joins the meeting tmrw"
Contains social media formatting, informal
Rejected
This structure allows reviewers to track changes, add normalized versions if needed, and document their decisions. When various people are working on a project, you can use cloud-based spreadsheets. These allow multiple reviewers to gain access at the same time.
One tip to make this step more efficient is to skip any sentence that will take some time to revise and correct. You can of course do this if you have plenty of sentences and it doesn’t matter if some of them are lost.
What to look for during review
When reviewing sentences, look at the following:
Language integrity: Make sure the sentence is in the and dialect. But be mindful of the difference between natural code-mixing and unintended language mixing. Many languages include loanwords or code-switching as a part of everyday speech. For example, in Hindi, English words like "mobile," "computer," or "internet" are commonly used and accepted. Similarly, "sugar" is often used to mean diabetes in many South Asian languages.
Cultural fit: Check that content is culturally sensitive and fitting for voice contributors to read aloud. This means avoiding sentences that contain offensive language, references that are not appropriate for the culture in question, or content that may be traumatic.
Speakability: Read the sentence aloud to make sure it flows naturally. Some sentences may be correct in grammar terms but awkward and difficult to read out.
Consistency: Make sure that formatting, terminology, and style are the same across the sentence set. If you are not consistent, it can confuse both voice contributors and the models trained on this data.
Information that allows people to be identified: Remove or change any sentences that contain personal data such as real names, phone numbers, addresses, or ID numbers. For example, "Contact Dr. Sarah Johnson at +1-555-123-4567 for an appointment" should be changed to: "Contact the doctor at the clinic for an appointment." This protects privacy and ensures that your voice doesn’t contain any sensitive information that could cause a problem if you publish it. Be very careful with medical, financial, or administrative texts that often contain personal details.
Last updated
Was this helpful?