# 4.4 Reviewing the sentences

Automated text processing is powerful but not perfect. Even the most sophisticated systems can produce sentences that are not suitable for voice recording. This is why having the sentences reviewed by a human is a key step in building a high-quality collection of sentences.

### The importance of human review

With automated extraction, you often get sentences that look okay at first glance but have subtle issues. For example, a sentence might be complete in terms of grammar but may include content that is culturally sensitive. Or it might use vocabulary from a different variant of the language. We need human judgement to identify and correct these issues.

### Setting up a review process

A clear and simple approach is to put the sentences into a spreadsheet with headed columns for the review process.  For example:

| Original Sentence                  | Revised Sentence                                    | Normalized Version                                          | Comments                                   | Status   |
| ---------------------------------- | --------------------------------------------------- | ----------------------------------------------------------- | ------------------------------------------ | -------- |
| Original Sentence                  | Revised Sentence                                    | Normalized Version                                          | Comments                                   | Status   |
| "The temp. was 32°C yesterday."    | "The temperature was 32 degrees Celsius yesterday." | "The temperature was thirty-two degrees Celsius yesterday." | Expanded abbreviation and symbols          | Approved |
| "Do NOT touch—dangerous!"          | "Do not touch. It is dangerous."                    | "Do not touch. It is dangerous."                            | Fixed formatting and capitalization        | Approved |
| "@everyone joins the meeting tmrw" | <p><br></p>                                         | <p><br></p>                                                 | Contains social media formatting, informal | Rejected |

This structure allows reviewers to track changes, add normalized versions if needed, and document their decisions. When various people are working on a project, you can use cloud-based spreadsheets. These allow multiple reviewers to gain access at the same time.

One tip to make this step more efficient is to skip any sentence that will take some time to revise and correct. You can of course do this if you have plenty of sentences and it doesn’t matter if some of them are lost.

### What to look for during review

When reviewing sentences, look at the following:

**Language integrity:** Make sure the sentence is in the [target language](#user-content-fn-1)[^1] and dialect. But be mindful of the difference between natural code-mixing and unintended language mixing. Many languages include loanwords or code-switching as a part of everyday speech. For example, in Hindi, English words like "mobile," "computer," or "internet" are commonly used and accepted. Similarly, "sugar" is often used to mean diabetes in many South Asian languages.

**Cultural fit:** Check that content is culturally sensitive and fitting for voice contributors to read aloud. This means avoiding sentences that contain offensive language, references that are not appropriate for the culture in question, or content that may be traumatic.

**Speakability:** Read the sentence aloud to make sure it flows naturally. Some sentences may be correct in grammar terms but awkward and difficult to read out.&#x20;

**Consistency:** Make sure that formatting, terminology, and style are the same across the sentence set. If you are not consistent, it can confuse both voice contributors and the models trained on this data.

**Information that allows people to be identified:** Remove or change any sentences that contain personal data such as real names, phone numbers, addresses, or ID numbers. For example, "Contact Dr. Sarah Johnson at +1-555-123-4567 for an appointment" should be changed to: "Contact the doctor at the clinic for an appointment." This protects privacy and ensures that your voice dataset[^2] doesn’t contain any sensitive information that could cause a problem if you publish it. Be very careful with medical, financial, or administrative texts that often contain personal details.

\ <br>

[^1]: **Target language:** The language(s) for which you are collecting the voice data.

[^2]: **Dataset:** A collection of information that has been organized for use. A **voice dataset** is a collection of voice recordings (paired with transcription) with additional  information (meta data) such as gender, age of the person  recording to give more information on how the data set is constructed and to avoid bias. It is for use in research and for training or improving voice models.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://twbvoiceplaybook.clearglobal.org/4.-guidelines-for-sentence-and-prompt-collection/4.4-reviewing-the-sentences.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
