# 4.2 License issues

It’s very important that you have a good understanding of licensing when you collect sentences from existing sources. If you use content without proper permissions, you may end up having legal issues. There may also be limits on how you can use or share your voice datasets[^1].

### Types of license

These are the main categories of content license:

* **Closed/proprietary licenses:** These restrict rights to use the content. You will normally have to get explicit permission to use it. Examples include copyrighted news articles, books, and most commercial content.
* **Open licenses:** These allow various forms of reuse, with different levels of freedom. The most common open licenses are [Creative Commons (CC)](#user-content-fn-2)[^2] licenses.
* [**Public domain:**](#user-content-fn-3)[^3] This is content with no copyright restrictions. Anyone can freely use it without permission. &#x20;

### Understanding Creative Commons licenses

Creative Commons (CC) licenses are a standardized way for content creators to give permission for their work to be used. Here are the most common types of CC license:

* **CC BY** (Attribution): You can use, change, and redistribute the content, even for commercial use. But you must credit the original creator.
* **CC BY-SA** (Attribution-ShareAlike): Similar to CC BY, but derivative works based on the original content also have to use the same license.
* **CC BY-NC** (Attribution-NonCommercial): Allows reuse if you credit the creator, but commercial use is not allowed.
* **CC BY-ND** (Attribution-NoDerivatives): Allows redistribution if you credit the creator, but you are not allowed to make changes.
* **CC0** (Public Domain Dedication): Creator gives up all rights and places the work in the public domain.

For more detailed information about Creative Commons licenses, go to the [Creative Commons website](https://creativecommons.org/licenses/).

### Where to find license information

You can find the license details in various places depending on the source:

* **Websites:** Look for "Terms of Service," "Terms of Use," or "Legal" links. These are often at the bottom of web pages. News sites, blogs, and websites of organizations usually have this information in their footer.
* **Datasets:** Check the documentation, README files, or metadata[^4] that came with the dataset. Popular platforms like [Hugging Face](#user-content-fn-5)[^5], Kaggle, or GitHub ask for license information for datasets that you upload.
* **Published materials:** Books, articles, and other published content often have copyright notices on their title pages or in the first section.
* **Open data portals:** Government data, academic platforms, and open data projects usually have clear licensing information on their download pages.

If you can't find the license information, it’s best to assume the content is copyrighted and you will need permission.

### When licenses don’t give you enough freedom

If you find some useful content, but the licensing is restrictive:

1. **Contact the rights holder:** Many content owners are happy to allow use of their content if it is for a non-commercial or research project. Give them a clear explanation of your project. Say how you'll use their content, and how it will help the community. Make sure you explain well what you plan to do with the content. Tell them whether you intend to republish sentences, and about your plans to build voice technologies.
2. **Document permissions:** If you get permission, keep a written record of all agreements.
3. **Inspiration is free!:** If you don’t get permission, create similar content with licenses that allow more freedom. Or create your own content. You are not allowed to take copyrighted sentences, but you can get inspiration from them!  &#x20;
4. **Consult legal experts:** For larger projects or if you have to deal with commercial organizations, you may need to get legal advice.

| <img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcg5NtNYlXvCfn0rvwJsLMtBRethQDyY91wSQ4A0x7HgV0_xvPbTMOyxTT86XhR1expXnNUPK9G94V6MnyhaAgELsOb2rxooMKX2eUYyAQyWQV2UiGu9I82Vizm8ZzHK9HDZyPK?key=LjOaNqlneHjM8MYR-1Jh9w" alt=":magnifying_glass:" data-size="line">  <mark style="color:blue;">**Case Study: CLEAR Global’s collection of books in Kanuri**</mark>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <p>Kanuri is a language spoken by around 10 million people across Nigeria, Niger, Chad, and Cameroon. Although so many people speak this language, there is very little digital content in Kanuri. Text resources are very limited, both online and in digital format.</p><p></p><p>When CLEAR Global set out to build <a data-footnote-ref href="#user-content-fn-6">voice technology</a> for Kanuri, we faced a major challenge: there was almost no digital text data available to work with. But we knew that several authors had written books in Kanuri. These books could be important sources for sentence collection.</p><p></p><p>We reached out to these authors through our Kanuri linguist and organized face-to-face meetings. In these meetings, we explained our project in simple terms—we needed their written material to develop AI applications that could speak Kanuri. We made it clear that we were interested in the language in their books rather than the stories.</p><p></p><p>"We don’t want to republish your books as literature," we explained. "We want to take the individual sentences and use them to teach computers how to speak Kanuri." We used simple language to help the authors to understand how their work could help in the development of <a data-footnote-ref href="#user-content-fn-7">language technology.</a></p><p></p><p>We suggested a fair arrangement. Authors would share their books in digital format (Word files when available). We would take sentences from them, and in return, pay them a symbolic fee per word. We also promised to credit them when publishing the data.</p><p></p><p>This idea brought great results. With the authors' permission, we created the first openly available text <a data-footnote-ref href="#user-content-fn-8">corpus</a> (collection of sentences) in Kanuri. It was <a href="https://huggingface.co/datasets/CLEAR-Global/kanuri-books-corpus">published on Hugging Face</a> with full credit to the original authors and their works. With this collection of texts, we could then create the first Kanuri voice corpus (collection of voice data). This made it possible for the first time for Kanuri speakers to access digital technologies in their language.</p><p></p><p>The reason this approach was successful was the human connection. We met face-to-face, we explained our goals in simple language, we offered fair payment, and we made sure the authors were credited. This built trust with the Kanuri authors. We could adapt this model for other languages with limited digital resources if there are printed materials available. This shows that, even when there is very little digital content, working with people in creative ways can give us access to useful language data.</p> |

### Licensing your own collections of sentences

The sentences you collect are important resources for language technology development. Consider the following when you license your own collections:

1. **Choose a suitable license:** CC BY or CC BY-SA licenses are common for datasets intended for broad use. They make sure people are credited but allow flexibility.
2. **Document clearly:** Include a LICENSE.txt file with your dataset and add license information to dataset metadata. Platforms like Hugging Face and Github allow you to add these automatically during the creation of your dataset.
3. **Consider your goals:**

* For maximum impact and reuse, choose licenses like CC BY or CC0 that allow a lot of freedom
* To make sure new versions of the content are available for use, use CC BY-SA
* If you are worried about commercial use, maybe use CC BY-NC

4. **Respect the rights of contributors:** If your sentences come from [community members](#user-content-fn-9)[^9], make sure you have their permission to license the content. You can also credit them by mentioning their names. Only do this if you have their permission and they don’t want to stay anonymous. For TTS projects, it’s very important that you don’t give away the contributors’ identities. This will avoid possible misuse of the TTS models made with their data.
5. **Include proper credits:** When you publish your dataset, credit all sources that contributed to your collection.

If you publish your collections of sentences with clear licensing, it not only helps other developers. It will also help to grow the amount of resources for your language. This is especially important for [low-resource languages](#user-content-fn-10)[^10].

Remember that licensing choices can impact how widely your data can be used to develop voice technologies. Licenses that allow more freedom generally result in greater use and impact.

[^1]: **Dataset:** A collection of information that has been organized for use. A **voice dataset** is a collection of voice recordings (paired with transcription) with additional  information (metadata) such as gender, age of the person  recording to give more information on how the data set is constructed and to avoid bias. It is for use in research and for training or improving voice models.

[^2]: **Creative Commons (CC) licenses:** A set of standard copyright licenses that let the creator of a piece of work choose which rights they want to keep and which they will give up.

[^3]: **Domain:** The subject area or setting in which we use language. Examples are healthcare, farming, or education. Each domain has its own special terms and language patterns.

[^4]: **Metadata:** Extra information that gives some background to a dataset or parts of a dataset. Examples are demographics of the speakers (age, gender, accent), recording conditions (microphone type, level of background noise), or language-related details (dialect, speed of speaking, emotional tone).

[^5]: **Hugging Face:** An online platform where researchers and developers share AI models, datasets, and applications.

[^6]: **Voice technology:** Language technology that processes or generates spoken words.

[^7]: **Language Technology (LT):** Technologies that focus on human language, including both spoken and written language. They can process, understand, and generate language. Examples are the tools on your phone or computer that understand and generate words, like translation apps or voice assistants. They allow us to communicate and interact with our devices with language. When you talk to a virtual assistant like Siri or Alexa, or your phone suggests the next word in a message, that’s because of language technology. It makes technology more accessible and user-friendly.

[^8]: **Corpus:** A collection of text or speech data that you can use to train language technology.

[^9]: **Community members:** People at the center of a project who record and validate data. They also give feedback to the team and to other members.

[^10]: **Low-resource language:** A language that has limited written or recorded materials and is rarely found in digital tools, data, or technology. This means that technologies like speech recognition or machine translation are not available for the language and are difficult to build. This may be due to a lack of written content, digital resources (like websites or videos), or support from organizations that develop language technology.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://twbvoiceplaybook.clearglobal.org/4.-guidelines-for-sentence-and-prompt-collection/4.2-license-issues.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
