4.2 License issues
It’s very important that you have a good understanding of licensing when you collect sentences from existing sources. If you use content without proper permissions, you may end up having legal issues. There may also be limits on how you can use or share your voice .
Types of license
These are the main categories of content license:
Closed/proprietary licenses: These restrict rights to use the content. You will normally have to get explicit permission to use it. Examples include copyrighted news articles, books, and most commercial content.
Open licenses: These allow various forms of reuse, with different levels of freedom. The most common open licenses are licenses.
This is content with no copyright restrictions. Anyone can freely use it without permission.
Understanding Creative Commons licenses
Creative Commons (CC) licenses are a standardized way for content creators to give permission for their work to be used. Here are the most common types of CC license:
CC BY (Attribution): You can use, change, and redistribute the content, even for commercial use. But you must credit the original creator.
CC BY-SA (Attribution-ShareAlike): Similar to CC BY, but derivative works based on the original content also have to use the same license.
CC BY-NC (Attribution-NonCommercial): Allows reuse if you credit the creator, but commercial use is not allowed.
CC BY-ND (Attribution-NoDerivatives): Allows redistribution if you credit the creator, but you are not allowed to make changes.
CC0 (Public Domain Dedication): Creator gives up all rights and places the work in the public domain.
For more detailed information about Creative Commons licenses, go to the Creative Commons website.
Where to find license information
You can find the license details in various places depending on the source:
Websites: Look for "Terms of Service," "Terms of Use," or "Legal" links. These are often at the bottom of web pages. News sites, blogs, and websites of organizations usually have this information in their footer.
Datasets: Check the documentation, README files, or that came with the dataset. Popular platforms like , Kaggle, or GitHub ask for license information for datasets that you upload.
Published materials: Books, articles, and other published content often have copyright notices on their title pages or in the first section.
Open data portals: Government data, academic platforms, and open data projects usually have clear licensing information on their download pages.
If you can't find the license information, it’s best to assume the content is copyrighted and you will need permission.
When licenses don’t give you enough freedom
If you find some useful content, but the licensing is restrictive:
Contact the rights holder: Many content owners are happy to allow use of their content if it is for a non-commercial or research project. Give them a clear explanation of your project. Say how you'll use their content, and how it will help the community. Make sure you explain well what you plan to do with the content. Tell them whether you intend to republish sentences, and about your plans to build voice technologies.
Document permissions: If you get permission, keep a written record of all agreements.
Inspiration is free!: If you don’t get permission, create similar content with licenses that allow more freedom. Or create your own content. You are not allowed to take copyrighted sentences, but you can get inspiration from them!
Consult legal experts: For larger projects or if you have to deal with commercial organizations, you may need to get legal advice.
Case Study: CLEAR Global’s collection of books in Kanuri
Kanuri is a language spoken by around 10 million people across Nigeria, Niger, Chad, and Cameroon. Although so many people speak this language, there is very little digital content in Kanuri. Text resources are very limited, both online and in digital format.
When CLEAR Global set out to build for Kanuri, we faced a major challenge: there was almost no digital text data available to work with. But we knew that several authors had written books in Kanuri. These books could be important sources for sentence collection.
We reached out to these authors through our Kanuri linguist and organized face-to-face meetings. In these meetings, we explained our project in simple terms—we needed their written material to develop AI applications that could speak Kanuri. We made it clear that we were interested in the language in their books rather than the stories.
"We don’t want to republish your books as literature," we explained. "We want to take the individual sentences and use them to teach computers how to speak Kanuri." We used simple language to help the authors to understand how their work could help in the development of
We suggested a fair arrangement. Authors would share their books in digital format (Word files when available). We would take sentences from them, and in return, pay them a symbolic fee per word. We also promised to credit them when publishing the data.
This idea brought great results. With the authors' permission, we created the first openly available text (collection of sentences) in Kanuri. It was published on Hugging Face with full credit to the original authors and their works. With this collection of texts, we could then create the first Kanuri voice corpus (collection of voice data). This made it possible for the first time for Kanuri speakers to access digital technologies in their language.
The reason this approach was successful was the human connection. We met face-to-face, we explained our goals in simple language, we offered fair payment, and we made sure the authors were credited. This built trust with the Kanuri authors. We could adapt this model for other languages with limited digital resources if there are printed materials available. This shows that, even when there is very little digital content, working with people in creative ways can give us access to useful language data.
Licensing your own collections of sentences
The sentences you collect are important resources for language technology development. Consider the following when you license your own collections:
Choose a suitable license: CC BY or CC BY-SA licenses are common for datasets intended for broad use. They make sure people are credited but allow flexibility.
Document clearly: Include a LICENSE.txt file with your dataset and add license information to dataset metadata. Platforms like Hugging Face and Github allow you to add these automatically during the creation of your dataset.
Consider your goals:
For maximum impact and reuse, choose licenses like CC BY or CC0 that allow a lot of freedom
To make sure new versions of the content are available for use, use CC BY-SA
If you are worried about commercial use, maybe use CC BY-NC
Respect the rights of contributors: If your sentences come from , make sure you have their permission to license the content. You can also credit them by mentioning their names. Only do this if you have their permission and they don’t want to stay anonymous. For TTS projects, it’s very important that you don’t give away the contributors’ identities. This will avoid possible misuse of the TTS models made with their data.
Include proper credits: When you publish your dataset, credit all sources that contributed to your collection.
If you publish your collections of sentences with clear licensing, it not only helps other developers. It will also help to grow the amount of resources for your language. This is especially important for .
Remember that licensing choices can impact how widely your data can be used to develop voice technologies. Licenses that allow more freedom generally result in greater use and impact.
Last updated
Was this helpful?