Commit ce979507 authored by Liisa Rätsep's avatar Liisa Rätsep
Browse files

multispeaker synthesis

parent 08f63b41
# Eestikeelne kõnesüntees
Skriptid eestikeelse mitmehäälse kõnesünteesi kasutamiseks teksifaili põhjal.
Kõnesüntees on loodud koostöös [Eesti Keele Instituudiga](http://portaal.eki.ee/)
Kõnesünteesi on võimalik kasutada ka meie [veebidemos](https://www.neurokone.ee). Samade mudelite rakendusliidese
Kõnesünteesi on võimalik kasutada ka meie [veebidemos](https://www.neurokone.ee). Samade mudelite rakendusliidese
komponendid on kättesaadavad [siit](https://github.com/TartuNLP/text-to-speech-api)
ja [siit](https://github.com/TartuNLP/text-to-speech-worker).
## Nõuded ja seadistamine
Siinseid instruktsioone on testitud Ubuntuga. Kood on nii CPU- kui GPU-sõbralik (vajab CUDA-t GPU kasutamiseks).
Siinseid instruktsioone on testitud Ubuntuga. Kood töötab nii CPU-de kui ka GPU-ga, kuid GPU-dega on süntees märgatavlt
kiirem.
- Veendu, et järgmised komponendid on installitud:
- Conda (loe lähemalt: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html)
- [CUDA](https://developer.nvidia.com/cuda-downloads) kui kasutad GPU-d
- [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html)
- GNU Compiler Collection (jooksuta: `sudo apt install build-essential`)
- Klooni see repositoorium koos alammoodulitega:
```
```commandline
git clone --recurse-submodules https://koodivaramu.eesti.ee/tartunlp/text-to-speech
```
- Loo ja aktiveeri Conda keskond:
```
- Loo ja aktiveeri Conda keskond. Vaheta alltoodud käsus keskkonna fail `environments/environment.gpu.yml` vastu, kui
soovid GPU-d kasutada.
```commandline
cd text-to-speech
conda env create -f environments/environment.yml
conda activate transformer-tts
python -c 'import nltk; nltk.download("punkt")'
```
- Lae alla meie [TransformerTTS mudelid](https://github.com/TartuNLP/text-to-speech-worker/releases/tag/v2.0.0) ja aseta need `models/` kausta.
- Lae alla meie [TransformerTTS mudel](https://github.com/TartuNLP/TransformerTTS/releases/tag/v1.1.0-beta.2) ja aseta
need `models/` kausta.
## Kasutamine
Tekstifaili saab sünteesida järgmise käsuga. Hetkel oskab skript lugeda vaid toorteksti kujul faile ja salvestab
Tekstifaili saab sünteesida järgmise käsuga. Hetkel oskab skript lugeda vaid toorteksti kujul faile ja salvestab
väljundi `.wav` formaadis.
```
python synthesizer.py --speaker albert test.txt test.wav
```commandline
python synthesizer.py --model models/albert --vocoder models/hifigan/vctk test.txt test.wav
```
Lisainfot skripti kasutamise kohta saab `--help` parameetriga:
```
synthesizer.py [-h] [--speaker SPEAKER] [--speed SPEED] [--config CONFIG] input output
synthesizer.py [-h] --model MODEL --vocoder VOCODER [--speed SPEED] [--speaker-id SPEAKER_ID] input output
positional arguments:
input Input text file to synthesize.
output Output .wav file path.
input Input text file to synthesize.
output Output .wav file path.
optional arguments:
-h, --help show this help message and exit
--speaker SPEAKER The name of the speaker to use for synthesis.
--speed SPEED Output speed multiplier.
--config CONFIG The config file to load.
-h, --help show this help message and exit
--model MODEL The directory of the TTS model weights (must contain a .hdf5 and config.yaml file)
--vocoder VOCODER The directory that contains the vocoder model.
--speed SPEED Output speed multiplier.
--speaker-id SPEAKER_ID Speaker ID for multispeaker models.
```
\ No newline at end of file
......@@ -4,35 +4,37 @@ Scripts for Estonian multispeaker speech synthesis from text file input.
Speech synthesis was developed in collaboration with the [Estonian Language Institute](http://portaal.eki.ee/).
Estonian text-to-speech can also be used via our [web demo](https://www.neurokone.ee). The components
Estonian text-to-speech can also be used via our [web demo](https://www.neurokone.ee). The components
to run the same models via API have can be found [here](https://github.com/TartuNLP/text-to-speech-api)
and [here](https://github.com/TartuNLP/text-to-speech-worker).
## Requirements and installation
The following installation instructions have been tested on Ubuntu. The code is both CPU and GPU compatible
(CUDA required).
The following installation instructions have been tested on Ubuntu. The code is both CPU and GPU compatible, but
synthesis is considerably faster with GPUs.
- Make sure you have the following prerequisites installed:
- [CUDA](https://developer.nvidia.com/cuda-downloads) if you use a GPU
- Conda (see https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html)
- GNU Compiler Collection (run `sudo apt install build-essential`)
- Clone with submodules
```
```commandline
git clone --recurse-submodules https://koodivaramu.eesti.ee/tartunlp/text-to-speech
```
- Create and activate a Conda environment with all dependencies.
- Create and activate a Conda environment with all dependencies. Use `environments/environment.gpu.yml` instead if you
use a GPU.
```
```commandline
cd text-to-speech
conda env create -f environments/environment.yml
conda activate transformer-tts
python -c 'import nltk; nltk.download("punkt")'
```
- Download our [TransformerTTS models](https://github.com/TartuNLP/text-to-speech-worker/releases/tag/v2.0.0) and
- Download our [TransformerTTS model](https://github.com/TartuNLP/TransformerTTS/releases/tag/v1.1.0-beta.2) and
place them inside the `models/` directory.
## Usage
......@@ -40,22 +42,23 @@ python -c 'import nltk; nltk.download("punkt")'
A file can be syntesized with the following command. Currently, only plain text files (utf-8) are supported and the
audio is saved in `.wav` format.
```
python synthesizer.py --speaker albert test.txt test.wav
```commandline
python synthesizer.py --model models/albert --vocoder models/hifigan/vctk test.txt test.wav
```
More info about script usage can be found with the `--help` flag:
```
synthesizer.py [-h] [--speaker SPEAKER] [--speed SPEED] [--config CONFIG] input output
synthesizer.py [-h] --model MODEL --vocoder VOCODER [--speed SPEED] [--speaker-id SPEAKER_ID] input output
positional arguments:
input Input text file to synthesize.
output Output .wav file path.
input Input text file to synthesize.
output Output .wav file path.
optional arguments:
-h, --help show this help message and exit
--speaker SPEAKER The name of the speaker to use for synthesis.
--speed SPEED Output speed multiplier.
--config CONFIG The config file to load.
-h, --help show this help message and exit
--model MODEL The directory of the TTS model weights (must contain a .hdf5 and config.yaml file)
--vocoder VOCODER The directory that contains the vocoder model.
--speed SPEED Output speed multiplier.
--speaker-id SPEAKER_ID Speaker ID for multispeaker models.
```
\ No newline at end of file
Subproject commit 4a624d1054c4da34e3544b87480872e3243845d6
Subproject commit 52eb984cf1eb3b6a745e573cf437bf648e5af025
### Transformer-TTS requried configuration
wav_directory: ''
metadata_path: ''
log_directory: ''
train_data_directory: ''
data_config: 'TransformerTTS/config/data_config_est.yaml'
aligner_config: 'TransformerTTS/config/aligner_config.yaml'
tts_config: 'TransformerTTS/config/tts_config_est.yaml'
data_name: ''
speakers:
albert:
config_path: config.yaml
checkpoint_path: models/tts/albert
vocoder_path: models/hifigan/vctk
kalev:
config_path: config.yaml
checkpoint_path: models/tts/kalev
vocoder_path: models/hifigan/vctk
kylli:
config_path: config.yaml
checkpoint_path: models/tts/kylli
vocoder_path: models/hifigan/ljspeech
mari:
config_path: config.yaml
checkpoint_path: models/tts/mari
vocoder_path: models/hifigan/ljspeech
meelis:
config_path: config.yaml
checkpoint_path: models/tts/meelis
vocoder_path: models/hifigan/vctk
vesta:
config_path: config.yaml
checkpoint_path: models/tts/vesta
vocoder_path: models/hifigan/vctk
\ No newline at end of file
......@@ -8,6 +8,7 @@ dependencies:
- python==3.7.10
- matplotlib==3.2.2
- librosa==0.7.1
- numba==0.48
- numpy==1.17.4
- ruamel.yaml==0.16.6
- tensorflow-gpu=2.2.0
......
......@@ -7,6 +7,7 @@ dependencies:
- python==3.7.10
- matplotlib==3.2.2
- librosa==0.7.1
- numba==0.48
- numpy==1.17.4
- ruamel.yaml==0.16.6
- tensorflow=2.2.0
......
......@@ -5,32 +5,31 @@ import re
import numpy as np
from scipy.io import wavfile
from tqdm import tqdm
import yaml
from yaml.loader import SafeLoader
from nltk import sent_tokenize
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
sys.path.append(f'{os.path.dirname(os.path.realpath(__file__))}/TransformerTTS')
from TransformerTTS.utils.config_manager import Config
from TransformerTTS.model.models import ForwardTransformer
from vocoding.predictors import HiFiGANPredictor
from tts_preprocess_et.convert import convert_sentence
class Synthesizer:
def __init__(self, config_path: str, checkpoint_path: str, vocoder_path: str):
def __init__(self, tts_model_path: str, vocoder_path: str):
self.silence = np.zeros(10000, dtype=np.int16)
self.config = Config(config_path=config_path)
self.model = self.config.load_model(checkpoint_path=checkpoint_path)
self.model = ForwardTransformer.load_model(tts_model_path)
self.vocoder = HiFiGANPredictor.from_folder(vocoder_path)
print("Transformer-TTS initialized.")
def synthesize(self, text: str, speed: float = 1):
def synthesize(self, text: str, speed: float = 1, speaker_id: str = 0):
"""Convert text to speech waveform.
Args:
text (str) : Input text to be synthesized
speed (float)
speaker_id (int)
"""
def clean(sent):
......@@ -63,10 +62,9 @@ class Synthesizer:
for i, sentence in enumerate(tqdm(sentences, unit="sentence")):
sentence = clean(sentence)
out = self.model.predict(sentence, speed_regulator=speed)
out = self.model.predict(sentence, speed_regulator=speed, speaker_id=speaker_id)
waveform = self.vocoder([out['mel'].numpy().T])
if i != 0:
waveforms.append(self.silence)
waveforms.append(self.silence)
waveforms.append(waveform[0])
waveform = np.concatenate(waveforms)
......@@ -82,21 +80,21 @@ if __name__ == '__main__':
help="Input text file to synthesize.")
parser.add_argument('output', type=FileType('w'),
help="Output .wav file path."),
parser.add_argument('--speaker', type=str, required=True,
help="The name of the speaker to use for synthesis.")
parser.add_argument('--model', required=True,
help="The directory of the TTS model weights (must contain a .hdf5 and config.yaml file)")
parser.add_argument('--vocoder', required=True,
help="The directory that contains the vocoder model.")
parser.add_argument('--speed', type=int, default=1,
help="Output speed multiplier.")
parser.add_argument('--config', type=FileType('r'), default='config.yaml',
help="The config file to load.")
parser.add_argument('--speaker-id', type=int, default=0,
help="Speaker ID for multispeaker models.")
args = parser.parse_known_args()[0]
with open(args.config.name, 'r', encoding='utf-8') as f:
config = yaml.load(f, Loader=SafeLoader)['speakers'][args.speaker]
synthesizer = Synthesizer(**config)
synthesizer = Synthesizer(tts_model_path=args.model,
vocoder_path=args.vocoder)
with open(args.input.name, 'r', encoding='utf-8') as f:
text = f.read()
waveform = synthesizer.synthesize(text, speed=args.speed)
waveform = synthesizer.synthesize(text, speed=args.speed, speaker_id=args.speaker_id)
wavfile.write(args.output.name, 22050, waveform.astype(np.int16))
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment