4.3 Automatic and manual processing of sentences
This section explains how to transform raw text documents to a high-quality sentence set ready to record and then suitable for through both automatic and manual techniques.
Understanding text processing pipelines
A text processing pipeline typically follows these stages:
Document collection: Gathering source materials in various formats (PDF, HTML, Word, plain text)
Text extraction: Converting documents to plain text (may involve OCR for scanned materials)
Sentence segmentation: Breaking text into individual sentences
Filtering: Removing unsuitable sentences based on predefined criteria
Normalization: Converting all text to a standardized format for speech applications
Final review: Manual checking and correction of the processed sentences
For a practical example of a text processing project, you can explore this open-source Aranese text corpus creation project. This repository contains a Python notebook that implements the entire pipeline for Occitan Aranese—a l spoken in Spain.
If you are curious or ready to explore details around each step - continue reading below, if you want to continue getting familiar with other more general aspects of data collection, you can skip to 4.4 Reviewing the sentences. You can always come back here later!
Let's explore each stage in more detail.
Text extraction techniques
Different document formats require different extraction approaches:
Plain text files (.txt): These can be processed directly
Structured documents (Word, HTML): Use libraries like BeautifulSoup (for HTML) or docx2txt (for Word) to extract text while preserving structure
PDFs: Use libraries like PyPDF2 or pdfplumber to extract text
Scanned documents: Require (OCR)
This code uses the Tesseract OCR engine with the Occitan language model ("oci") to convert image files extracted from PDFs into text. For low-resource languages where OCR models may not exist, you might need to use a model from a closely related language or train a custom model.
# Extract text from PDF images using OCR
def ocr_image_to_text(image_path):
image = Image.open(image_path)
text = pytesseract.image_to_string(image, lang="oci")
return text
It's important to note that OCR output often contains errors, especially for low-resource languages or documents with poor image quality. These errors can include character misrecognition, incorrect word breaks, and formatting issues. This is why the Aranese project implements a post-OCR correction step using a large language model.
For your language, you can implement OCR correction through:
Rule-based corrections: Create a set of rules to fix common OCR errors in your language, such as frequently misrecognized characters or typical formatting issues.
Dictionary-based corrections: Use a dictionary of words in your language to detect and correct non-existent words that might be OCR errors.
LLM-assisted corrections: As shown in the Aranese project, you can use LLMs with clear instructions to clean up OCR output. This works particularly well for languages the LLM has been trained on. You can enhance results by providing the LLM with specific rules about your language's orthography like the character set of your language or a vocabulary list.
Sentence segmentation
Accurate sentence segmentation is essential for voice recording, as recordings are typically organized sentence by sentence. Punctuation marks like periods, question marks, and exclamation points usually indicate sentence boundaries.
Basic segmentation can be accomplished using regular expressions or natural language processing libraries. For example:
import re
# Simple segmentation using regular expressions
def segment_sentences(text):
# Split on period, exclamation mark, or question mark followed by space or newline
sentences = re.split(r'[.!?][\s\n]', text)
# Clean up each sentence
sentences = [s.strip() for s in sentences if s.strip()]
return sentences
For more sophisticated segmentation, specialized libraries are available:
from nltk.tokenize import sent_tokenize
# More advanced segmentation using NLTK
sentences = sent_tokenize(text, language='english')
While these libraries work well for high resourced and/or latin-based languages, for less-resourced languages, you might need to develop custom rules based on your language's punctuation patterns.
Filtering sentences
Not all extracted sentences will be suitable for voice recording. It's essential to filter sentences based on several criteria:
Length: Very short sentences may not provide enough context, while very long sentences can be difficult to record
Character set: Sentences containing symbols or characters outside your language's alphabet should be filtered out
Content: Sentences with sensitive content, code snippets, tabular data, or excessive technical jargon
Duplicates: Remove duplicate sentences to maximize diversity
Using AI to assist with script development
Modern large language models (LLMs) can help write and refine text processing scripts. For example, you could prompt an LLM with:
I have a directory of text files in [LANGUAGE] that I need to process
for a voice corpus. Write a Python script that:
1. Extracts sentences from each file
2. Keeps only sentences that are between 5 and 400 characters
3. Removes sentences with non-[LANGUAGE] characters
4. Removes duplicate sentences
5. Outputs the results to a single file with one sentence per line
Include error handling and progress reporting.
This prompt can be improved by:
Specifying the exact alphabet of your language
Providing examples of "good" and "bad" sentences
Including any special rules for your language's punctuation
Describing any format-specific handling (PDFs, HTML, etc.)
Mentioning any -specific filtering (e.g., excluding medical terminology)
Sentence normalization
Normalization is the process of converting text to a standard format suitable for speech applications. This is particularly important for ASR () and TTS ( models, which need consistent representations of words.
Key normalization tasks include:
Expanding abbreviations e.g. converting "Dr." to "Doctor", "Mr." to "Mister", etc.
Converting numbers e.g writing out "123" as "one hundred twenty-three"
Handling dates and times e.g formatting "08/12/2024" as "August twelfth, two thousand twenty-four"
Standardizing symbols e.g. convert "&" to "and", "@" to "at", etc.
Removing unnecessary punctuation e.g. removing hyphens, slashes, and other symbols that wouldn't be spoken
For complex normalization, you might need language-specific rules and dictionaries.
Practical considerations for large projects
When working with large text collections:
Processing time: OCR and complex text processing can be time-consuming. Consider dividing the task into multiple CPUs (parallel processing) for large .
Storage management: Maintain organized directories for raw text, processed text, and final sentences.
Quality control: Implement random sampling to check the quality of automatically processed sentences.
Iterative refinement: Start with a small sample, refine your process, then apply to the full dataset.
Documentation: Keep detailed records of all processing steps and decisions, especially custom rules.
Last updated
Was this helpful?