# 4.3 Automatic and manual processing of sentences

This section explains how to transform raw text documents to a high-quality sentence set ready to record and then suitable for training[^1] models[^2] through both automatic and manual techniques.

### Understanding text processing pipelines

A text processing pipeline typically follows these stages:

1. **Document collection:** Gathering source materials in various formats (PDF, HTML, Word, plain text)
2. **Text extraction:** Converting documents to plain text (may involve OCR for scanned materials)
3. **Sentence segmentation:** Breaking text into individual sentences
4. **Filtering:** Removing unsuitable sentences based on predefined criteria
5. **Normalization:** Converting all text to a standardized format for speech applications
6. **Final review:** Manual checking and correction of the processed sentences

For a practical example of a text processing project, you can explore this open-source [Aranese text corpus creation project](https://github.com/CollectivaT-dev/aranese-text-corpus/tree/main). This repository contains a Python notebook that implements the entire pipeline for Occitan Aranese—a l[ow-resource language](#user-content-fn-3)[^3] spoken in Spain.

If you are curious or ready to explore details around each step - continue reading below, if you want to continue getting familiar with other more general aspects of data collection, you can skip to [4.4 Reviewing the sentences](/4.-guidelines-for-sentence-and-prompt-collection/4.4-reviewing-the-sentences.md). You can always come back here later!&#x20;

Let's explore each stage in more detail.

### Text extraction techniques

Different document formats require different extraction approaches:

* **Plain text files (.txt):** These can be processed directly
* **Structured documents (Word, HTML):** Use libraries like BeautifulSoup (for HTML) or docx2txt (for Word) to extract text while preserving structure
* **PDFs:** Use libraries like PyPDF2 or pdfplumber to extract text
* **Scanned documents:** Require [Optical Character Recognition](#user-content-fn-4)[^4] (OCR)

This code uses the Tesseract OCR engine with the Occitan language model ("oci") to convert image files extracted from PDFs into text. For low-resource languages where OCR models may not exist, you might need to use a model from a closely related language or train a custom model.

```python
# Extract text from PDF images using OCR
def ocr_image_to_text(image_path):
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image, lang="oci")
    return text
```

It's important to note that OCR output often contains errors, especially for low-resource languages or documents with poor image quality. These errors can include character misrecognition, incorrect word breaks, and formatting issues. This is why the Aranese corpus[^5] project implements a post-OCR correction step using a large language model.

For your language, you can implement OCR correction through:

1. **Rule-based corrections:** Create a set of rules to fix common OCR errors in your language, such as frequently misrecognized characters or typical formatting issues.
2. **Dictionary-based corrections:** Use a dictionary of words in your language to detect and correct non-existent words that might be OCR errors.
3. **LLM-assisted corrections:** As shown in the Aranese project, you can use LLMs with clear instructions to clean up OCR output. This works particularly well for languages the LLM has been trained on. You can enhance results by providing the LLM with specific rules about your language's orthography like the character set of your language or a vocabulary list.

### Sentence segmentation

Accurate sentence segmentation is essential for voice recording, as recordings are typically organized sentence by sentence. Punctuation marks like periods, question marks, and exclamation points usually indicate sentence boundaries.

Basic segmentation can be accomplished using regular expressions or natural language processing libraries. For example:

```python
import re
# Simple segmentation using regular expressions
def segment_sentences(text):
    # Split on period, exclamation mark, or question mark followed by space or newline
    sentences = re.split(r'[.!?][\s\n]', text)
    # Clean up each sentence
    sentences = [s.strip() for s in sentences if s.strip()]
    return sentences

```

For more sophisticated segmentation, specialized libraries are available:

```python
from nltk.tokenize import sent_tokenize

# More advanced segmentation using NLTK
sentences = sent_tokenize(text, language='english')
```

While these libraries work well for high resourced and/or latin-based languages, for less-resourced languages, you might need to develop custom rules based on your language's punctuation patterns.

#### Filtering sentences

Not all extracted sentences will be suitable for voice recording. It's essential to filter sentences based on several criteria:

1. **Length**: Very short sentences may not provide enough context, while very long sentences can be difficult to record
2. **Character set:** Sentences containing symbols or characters outside your language's alphabet should be filtered out
3. **Content:** Sentences with sensitive content, code snippets, tabular data, or excessive technical jargon
4. **Duplicates:** Remove duplicate sentences to maximize diversity

### Using AI to assist with script development

Modern large language models (LLMs) can help write and refine text processing scripts. For example, you could prompt an LLM with:

```markup
I have a directory of text files in [LANGUAGE] that I need to process 
for a voice corpus. Write a Python script that:
1. Extracts sentences from each file
2. Keeps only sentences that are between 5 and 400 characters
3. Removes sentences with non-[LANGUAGE] characters
4. Removes duplicate sentences
5. Outputs the results to a single file with one sentence per line

Include error handling and progress reporting.
```

This prompt can be improved by:

1. Specifying the exact alphabet of your language
2. Providing examples of "good" and "bad" sentences
3. Including any special rules for your language's punctuation
4. Describing any format-specific handling (PDFs, HTML, etc.)
5. Mentioning any domain[^6]-specific filtering (e.g., excluding medical terminology)

### Sentence normalization

Normalization is the process of converting text to a standard format suitable for speech applications. This is particularly important for ASR ([Automatic Speech Recognition](#user-content-fn-7)[^7]) and TTS (Text-to-Speech)[^8] models, which need consistent representations of words.

Key normalization tasks include:

1. **Expanding abbreviations** e.g. converting "Dr." to "Doctor", "Mr." to "Mister", etc.
2. **Converting numbers** e.g writing out "123" as "one hundred twenty-three"
3. **Handling dates and times** e.g formatting "08/12/2024" as "August twelfth, two thousand twenty-four"
4. **Standardizing symbols** e.g. convert "&" to "and", "@" to "at", etc.
5. **Removing unnecessary punctuation** e.g. removing hyphens, slashes, and other symbols that wouldn't be spoken

For complex normalization, you might need language-specific rules and dictionaries.&#x20;

### Practical considerations for large projects

When working with large text collections:

1. **Processing time:** OCR and complex text processing can be time-consuming. Consider dividing the task into multiple CPUs (parallel processing) for large datasets[^9].
2. **Storage management:** Maintain organized directories for raw text, processed text, and final sentences.
3. **Quality control:** Implement random sampling to check the quality of automatically processed sentences.
4. **Iterative refinement:** Start with a small sample, refine your process, then apply to the full dataset.
5. **Documentation:** Keep detailed records of all processing steps and decisions, especially custom rules.

[^1]: **Training**: Teaching a computer-based model to recognize patterns. To do this, you need to show it large amounts of data as examples.

[^2]: **Model:** A computer-based system that has made use of data to learn patterns. It can make predictions or generate language. In speech technology, models use voice data to learn to recognize speech (converting audio to text) or to create speech (converting text to audio).

[^3]: **Low-resource language:** A language that has limited tools, data, or technology. This means that things like speech recognition, machine translation, or other language technologies are not available. This may be because there is not enough written material or digital resources, such as websites or videos in this language. Or not enough support from organizations that could make language technology.

[^4]: **Optical Character Recognition (OCR):** Technology that converts pictures of written text (like scanned documents) into text that a computer can read. This makes it possible to take content from printed materials and process it digitally.

[^5]: **Corpus:** A collection of text or speech data that you can use to train language technology.

[^6]: **Domain:** The subject area or setting in which we use language. Examples are healthcare, farming, or education. Each domain has its own special terms and language patterns.

[^7]: **Automatic Speech Recognition (ASR) or Speech-to-Text (STT):** ASR converts spoken language into text. It can be used in voice assistants and transcription services, for example. You may hear both terms, ASR and STT, and they are almost the same. But STT can be a semi-manual process, while ASR is fully automated. ASR is like a smart listener that turns spoken words into written text on your device. While it does create the text automatically, it often needs some human input to make sure everything is correct, it can make mistakes especially in languages that are less used in the digital space.

[^8]: **Text-to-Speech (TTS):** Technology that converts written text into spoken language. It can also generate audio speech. You can find TTS in accessibility tools like screen readers and in virtual assistants. It brings written texts to life through spoken words. If your phone reads out the messages you get, or an audiobook tells you your favorite story, this is because of TTS technology.

[^9]: **Dataset:** A collection of information that has been organized for use. A **voice dataset** is a collection of voice recordings (paired with transcription) with additional information (metadata) such as gender, age of the person recording to give more information on how the data set is constructed and to avoid bias. It is for use in research and for training or improving voice models.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://twbvoiceplaybook.clearglobal.org/4.-guidelines-for-sentence-and-prompt-collection/4.3-automatic-and-manual-processing-of-sentences.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
