CoquiTTS is a Python text-to-speech synthesis library. It uses cutting-edge models to transform any text into natural-sounding speech. CoquiTTS can be used to create audio content, improve accessibility, and add voice interactivity to your applications. In this article, you will learn how to install and use CoquiTTS in Python.
Table of Contents
Speech synthesis technology has advanced significantly over the years as a result of advances in artificial intelligence and machine learning. These advancements have enabled the generation of increasingly natural-sounding speech. This technology has the potential to benefit a wide range of applications, but it is especially important for low-resource languages struggling to preserve their linguistic history.
The Coqui AI team created CoquiTTS, an open-source speech synthesis program that uses Python text to speech. The software is designed to meet the specific needs of low-resource languages, making it an extremely effective tool for language preservation and revitalization efforts around the world.
CoquiTTS: A Powerful Python Text to Speech Speech Synthesis Tool
CoquiTTS is a Python text to speech speech synthesis application that uses a neural network to generate speech from text. Tacotron 2, a deep neural network architecture devised by Google researchers for voice synthesis, serves as the foundation. CoquiTTS improves on Tacotron 2 by providing faster and more efficient performance, as well as improved accessibility for Python text-to-speech developers and users.
One of CoquiTTS’s main advantages is its high level of accuracy with Python text to voice. CoquiTTS’ neural network is trained on a large corpus of speech data, allowing it to generate speech that sounds more natural than competing speech synthesis programmes that use Python text to voice. Furthermore, CoquiTTS is highly customizable using Python text to speech, allowing users to tailor parameters such as speaking rate, voice pitch, and volume to their specific requirements.
When using Python text to speech, CoquiTTS is also faster than other speech synthesis tools. It can generate real-time speech, making it ideal for voice assistants, text-to-speech systems, and interactive voice response (IVR) systems that use Python text to speech. This performance is achieved using neural vocoding, a technique that compresses the neural network used for voice synthesis into a lower file size, resulting in faster and more efficient processing when utilizing Python text to speech.
Using Python Text to Speech to Empower Low-Resource Languages with CoquiTTS
Speech synthesis technology has the potential to be useful for a wide range of applications, but it is especially important for low-resource languages that use Python text to speech. Due to globalization, urbanization, and the dominance of more regularly spoken languages, these languages frequently confront issues in conserving and maintaining their linguistic history.
CoquiTTS, which uses Python text-to-speech, provides an effective solution for addressing these issues by supporting language preservation and revitalization activities for low-resource languages. CoquiTTS can be used to develop speech synthesisers for such languages, allowing speakers to access information and communicate with others more easily using Python text to speech. CoquiTTS can also be used to construct speech interfaces for mobile devices, smart speakers, and home appliances, making technology more accessible to low-resource language speakers.
CoquiTTS has been successfully implemented in a number of languages utilising Python text-to-speech technology. Kinyarwanda, a Bantu language spoken in Rwanda and neighboring countries that has struggled to preserve its linguistic heritage, was utilized to develop a speech synthesizer utilizing CoquiTTS and Python text-to-speech. The Kinyarwanda Speech Synthesis Project gathered Kinyarwanda speech samples, trained the neural network utilized by CoquiTTS, and built a high-quality speech synthesizer. This synthesizer has the potential to help Kinyarwanda speakers in a range of applications.
Another successful CoquiTTS deployment is in the indigenous Mexican language of Ayapaneco, which was on the verge of extinction. The Coqui AI team worked with Ayapaneco language advocates to create a speech synthesizer utilising CoquiTTS and Python text to speech, enhancing Ayapaneco’s visibility and accessibility to a wider audience.
To use CoquiTTS in Python, you can follow these steps:
Installing CoquiTTS using pip:
pip install coqui-tts
If you plan to code or train models, clone TTS and install it locally.
git clone https://github.com/coqui-ai/TTS
pip install -e .[all,dev,notebooks] # Select the relevant extras
If you are on Ubuntu (Debian), you can also run following commands for installation.
$ make system-deps # intended to be used on Ubuntu (Debian). Let us know if you have a different OS.
$ make install
Docker Image
You can also try TTS without install with the docker image. Simply run the following command and you will be able to run TTS without installing it.
docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu
python3 TTS/server/server.py --list_models #To get the list of available models
python3 TTS/server/server.py --model_name tts_models/en/vctk/vits # To start a server
Synthesizing speech by TTS
from TTS.api import TTS
# Running a multi-speaker and multi-lingual model
# List available 🐸TTS models and choose the first one
model_name = TTS.list_models()[0]
# Init TTS
tts = TTS(model_name)
# Run TTS
# ❗ Since this model is multi-speaker and multi-lingual, we must set the target speaker and the language
# Text to speech with a numpy output
wav = tts.tts("This is a test! This is also a test!!", speaker=tts.speakers[0], language=tts.languages[0])
# Text to speech to a file
tts.tts_to_file(text="Hello world!", speaker=tts.speakers[0], language=tts.languages[0], file_path="output.wav")
# Running a single speaker model
# Init TTS with the target model name
tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False, gpu=False)
# Run TTS
tts.tts_to_file(text="Ich bin eine Testnachricht.", file_path=OUTPUT_PATH)
# Example voice cloning with YourTTS in English, French and Portuguese:
tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False, gpu=True)
tts.tts_to_file("This is voice cloning.", speaker_wav="my/cloning/audio.wav", language="en", file_path="output.wav")
tts.tts_to_file("C'est le clonage de la voix.", speaker_wav="my/cloning/audio.wav", language="fr-fr", file_path="output.wav")
tts.tts_to_file("Isso é clonagem de voz.", speaker_wav="my/cloning/audio.wav", language="pt-br", file_path="output.wav")
# Example voice conversion converting speaker of the `source_wav` to the speaker of the `target_wav`
tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False, gpu=True)
tts.voice_conversion_to_file(source_wav="my/source.wav", target_wav="my/target.wav", file_path="output.wav")
# Example voice cloning by a single speaker TTS model combining with the voice conversion model. This way, you can
# clone voices by using any model in 🐸TTS.
tts = TTS("tts_models/de/thorsten/tacotron2-DDC")
tts.tts_with_vc_to_file(
"Wie sage ich auf Italienisch, dass ich dich liebe?",
speaker_wav="target/speaker.wav",
file_path="ouptut.wav"
)
# Example text to speech using [🐸Coqui Studio](https://coqui.ai) models. You can use all of your available speakers in the studio.
# [🐸Coqui Studio](https://coqui.ai) API token is required. You can get it from the [account page](https://coqui.ai/account).
# You should set the `COQUI_STUDIO_TOKEN` environment variable to use the API token.
# If you have a valid API token set you will see the studio speakers as separate models in the list.
# The name format is coqui_studio/en/<studio_speaker_name>/coqui_studio
models = TTS().list_models()
# Init TTS with the target studio speaker
tts = TTS(model_name="coqui_studio/en/Torcull Diarmuid/coqui_studio", progress_bar=False, gpu=False)
# Run TTS
tts.tts_to_file(text="This is a test.", file_path=OUTPUT_PATH)
# Run TTS with emotion and speed control
tts.tts_to_file(text="This is a test.", file_path=OUTPUT_PATH, emotion="Happy", speed=1.5)
Command line tts
Single Speaker Models
- List provided models:
$ tts --list_models
- Get model info (for both tts_models and vocoder_models):
Query by type/name: The model_info_by_name uses the name as it from the –list_models.
$ tts --model_info_by_name "<model_type>/<language>/<dataset>/<model_name>"
For example:
$ tts --model_info_by_name tts_models/tr/common-voice/glow-tts
$ tts --model_info_by_name vocoder_models/en/ljspeech/hifigan_v2
Query by type/idx: The model_query_idx uses the corresponding idx from –list_models.
$ tts --model_info_by_idx "<model_type>/<model_query_idx>"
For example:
$ tts --model_info_by_idx tts_models/3
- Run TTS with default models:
$ tts --text "Text for TTS" --out_path output/path/speech.wav
- Run a TTS model with its default vocoder model:
$ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
For example:
$ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --out_path output/path/speech.wav
- Run with specific TTS and vocoder models from the list:
$ tts --text "Text for TTS" --model_name "<model_type>/<language>/<dataset>/<model_name>" --vocoder_name "<model_type>/<language>/<dataset>/<model_name>" --out_path output/path/speech.wav
For example:
$ tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/glow-tts" --vocoder_name "vocoder_models/en/ljspeech/univnet" --out_path output/path/speech.wav
- Run your own TTS model (Using Griffin-Lim Vocoder):
$ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
- Run your own TTS and Vocoder models:
$ tts --text "Text for TTS" --model_path path/to/model.pth --config_path path/to/config.json --out_path output/path/speech.wav
--vocoder_path path/to/vocoder.pth --vocoder_config_path path/to/vocoder_config.json
Multi-speaker Models
- List the available speakers and choose a <speaker_id> among them:
$ tts --model_name "<language>/<dataset>/<model_name>" --list_speaker_idxs
- Run the multi-speaker TTS model with the target speaker ID:
$ tts --text "Text for TTS." --out_path output/path/speech.wav --model_name "<language>/<dataset>/<model_name>" --speaker_idx <speaker_id>
- Run your own multi-speaker TTS model:
$ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>
Also Read Bark: Text to Speech New AI tool.
This article helps you learn about CoquiTTS. We trust that it has been helpful to you. Please feel free to share your thoughts and feedback in the comment section below.