Commit 08f63b41 authored by Liisa Rätsep's avatar Liisa Rätsep
Browse files

transformer-tts file synthesis

parent 81e8485e
/.idea/
config.json
server.wsgi
*.pth
__pycache__/
tts_preprocess_et/
TransformerTTS
*.iml
\ No newline at end of file
*.iml
deepvoice3_pytorch/
\ No newline at end of file
[submodule "deepvoice3_pytorch"]
path = deepvoice3_pytorch
url = https://github.com/TartuNLP/deepvoice3_pytorch
[submodule "TransformerTTS"]
path = TransformerTTS
url = https://github.com/TartuNLP/TransformerTTS.git
[submodule "tts_preprocess_et"]
path = tts_preprocess_et
url = https://github.com/TartuNLP/tts_preprocess_et.git
# Eestikeelne kõnesüntees
Skriptid eestikeelse mitmehäälse kõnesünteesi kasutamiseks teksifaili põhjal. Kood sisaldab alammooduleid, mis viitavad järgmistele kõnesünteesi komponentidele:
- [eesti keelele kohandatud Deep Voice 3](https://github.com/TartuNLP/deepvoice3_pytorch)
- [eestikeelse kõnesünteesi eeltöötlus](https://github.com/TartuNLP/tts_preprocess_et)
Skriptid eestikeelse mitmehäälse kõnesünteesi kasutamiseks teksifaili põhjal.
Kõnesüntees on loodud koostöös [Eesti Keele Instituudiga](http://portaal.eki.ee/)
......@@ -25,13 +23,11 @@ git clone --recurse-submodules https://koodivaramu.eesti.ee/tartunlp/text-to-spe
- Loo ja aktiveeri Conda keskond:
```
cd text-to-speech
conda env create -f environment.yml
conda activate deepvoice
pip install --no-deps -e "deepvoice3_pytorch/[bin]"
python -c 'import nltk; nltk.download("punkt"); nltk.download("cmudict")'
conda env create -f environments/environment.yml
conda activate transformer-tts
python -c 'import nltk; nltk.download("punkt")'
```
- Lae alla meie [Deep Voice 3 mudel](https://github.com/TartuNLP/deepvoice3_pytorch/releases/kratt-v1.2) ja aseta
see `models/` kausta. Siin viidatud mudel toetab kuue kõneleja häält.
- Lae alla meie [TransformerTTS mudelid](https://github.com/TartuNLP/text-to-speech-worker/releases/tag/v2.0.0) ja aseta need `models/` kausta.
## Kasutamine
......@@ -39,13 +35,13 @@ Tekstifaili saab sünteesida järgmise käsuga. Hetkel oskab skript lugeda vaid
väljundi `.wav` formaadis.
```
python synthesizer.py test.txt test.wav
python synthesizer.py --speaker albert test.txt test.wav
```
Lisainfot skripti kasutamise kohta saab `--help` parameetriga:
```
synthesizer.py [-h] [--checkpoint CHECKPOINT] [--preset PRESET] [--speaker-id SPEAKER_ID] input output
synthesizer.py [-h] [--speaker SPEAKER] [--speed SPEED] [--config CONFIG] input output
positional arguments:
input Input text file to synthesize.
......@@ -53,7 +49,7 @@ positional arguments:
optional arguments:
-h, --help show this help message and exit
--checkpoint CHECKPOINT The checkpoint (model file) to load.
--preset PRESET Model preset file.
--speaker-id SPEAKER_ID The ID of the speaker to use for synthesis.
--speaker SPEAKER The name of the speaker to use for synthesis.
--speed SPEED Output speed multiplier.
--config CONFIG The config file to load.
```
\ No newline at end of file
# Estonian Text-to-Speech
Scripts for Estonian multispeaker speech synthesis from text file input. This repository contains the following
submodules:
- [Deep Voice 3 adaptation for Estonian](https://github.com/TartuNLP/deepvoice3_pytorch)
- [Estonian text-to-speech preprocessing scripts](https://github.com/TartuNLP/tts_preprocess_et)
Scripts for Estonian multispeaker speech synthesis from text file input.
Speech synthesis was developed in collaboration with the [Estonian Language Institute](http://portaal.eki.ee/).
......@@ -31,14 +27,13 @@ git clone --recurse-submodules https://koodivaramu.eesti.ee/tartunlp/text-to-spe
```
cd text-to-speech
conda env create -f environment.yml
conda activate deepvoice
pip install --no-deps -e "deepvoice3_pytorch/[bin]"
python -c 'import nltk; nltk.download("punkt"); nltk.download("cmudict")'
conda env create -f environments/environment.yml
conda activate transformer-tts
python -c 'import nltk; nltk.download("punkt")'
```
- Download our [Deep Voice 3 model](https://github.com/TartuNLP/deepvoice3_pytorch/releases/kratt-v1.2) and place it
inside the `models/` directory. The model we reference to in this version supports six different speakers.
- Download our [TransformerTTS models](https://github.com/TartuNLP/text-to-speech-worker/releases/tag/v2.0.0) and
place them inside the `models/` directory.
## Usage
......@@ -46,13 +41,13 @@ A file can be syntesized with the following command. Currently, only plain text
audio is saved in `.wav` format.
```
python synthesizer.py test.txt test.wav
python synthesizer.py --speaker albert test.txt test.wav
```
More info about script usage can be found with the `--help` flag:
```
synthesizer.py [-h] [--checkpoint CHECKPOINT] [--preset PRESET] [--speaker-id SPEAKER_ID] input output
synthesizer.py [-h] [--speaker SPEAKER] [--speed SPEED] [--config CONFIG] input output
positional arguments:
input Input text file to synthesize.
......@@ -60,7 +55,7 @@ positional arguments:
optional arguments:
-h, --help show this help message and exit
--checkpoint CHECKPOINT The checkpoint (model file) to load.
--preset PRESET Model preset file.
--speaker-id SPEAKER_ID The ID of the speaker to use for synthesis.
--speaker SPEAKER The name of the speaker to use for synthesis.
--speed SPEED Output speed multiplier.
--config CONFIG The config file to load.
```
\ No newline at end of file
Subproject commit 4a624d1054c4da34e3544b87480872e3243845d6
### Transformer-TTS requried configuration
wav_directory: ''
metadata_path: ''
log_directory: ''
train_data_directory: ''
data_config: 'TransformerTTS/config/data_config_est.yaml'
aligner_config: 'TransformerTTS/config/aligner_config.yaml'
tts_config: 'TransformerTTS/config/tts_config_est.yaml'
data_name: ''
speakers:
albert:
config_path: config.yaml
checkpoint_path: models/tts/albert
vocoder_path: models/hifigan/vctk
kalev:
config_path: config.yaml
checkpoint_path: models/tts/kalev
vocoder_path: models/hifigan/vctk
kylli:
config_path: config.yaml
checkpoint_path: models/tts/kylli
vocoder_path: models/hifigan/ljspeech
mari:
config_path: config.yaml
checkpoint_path: models/tts/mari
vocoder_path: models/hifigan/ljspeech
meelis:
config_path: config.yaml
checkpoint_path: models/tts/meelis
vocoder_path: models/hifigan/vctk
vesta:
config_path: config.yaml
checkpoint_path: models/tts/vesta
vocoder_path: models/hifigan/vctk
\ No newline at end of file
Subproject commit a04f0b31af667b06328dc9573cc1f458f09a5d73
name: deepvoice
channels:
- anaconda
- pytorch
- estnltk
- conda-forge
dependencies:
- python==3.7.10
- numpy=1.17.4
- scipy==1.3.2
- pytorch=1.8.1
- Unidecode==1.1.1
- inflect==4.0.0
- librosa==0.7.1
- numba==0.47.0
- nltk==3.6.2
- docopt==0.6.2
- tensorboardx=1.2
- estnltk=1.6.9b
- pysoundfile==0.10.3.post1
- pip
- pip:
- lws
- nnmnkwii>=0.0.19
name: transformer-tts
channels:
- conda-forge
- pypi
- estnltk
- pytorch
dependencies:
- python==3.7.10
- matplotlib==3.2.2
- librosa==0.7.1
- numpy==1.17.4
- ruamel.yaml==0.16.6
- tensorflow-gpu=2.2.0
- pysoundfile
- scipy
- nltk
- estnltk=1.6.9b
- pytorch==1.4.0
- torchvision==0.5.0
- pyyaml>=5.4.1
- cudatoolkit=10.1.243
- cudnn==7.6.5
- pip
- pip:
- webrtcvad
- pyworld
- phonemizer==2.2.1
\ No newline at end of file
name: transformer-tts
channels:
- conda-forge
- estnltk
- pytorch
dependencies:
- python==3.7.10
- matplotlib==3.2.2
- librosa==0.7.1
- numpy==1.17.4
- ruamel.yaml==0.16.6
- tensorflow=2.2.0
- pysoundfile
- scipy
- nltk
- estnltk==1.6.9b
- pytorch==1.4.0
- torchvision==0.5.0
- pyyaml>=5.4.1
- pip
- pip:
- webrtcvad
- pyworld
- phonemizer==2.2.1
\ No newline at end of file
# coding: utf-8
import os
import sys
import re
import numpy as np
import torch
from scipy.io import wavfile
from tqdm import tqdm
from hparams import hparams
from deepvoice3_pytorch import frontend
from train import build_model
import audio
import train
import yaml
from yaml.loader import SafeLoader
from nltk import sent_tokenize
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
sys.path.append(f'{os.path.dirname(os.path.realpath(__file__))}/TransformerTTS')
from TransformerTTS.utils.config_manager import Config
from vocoding.predictors import HiFiGANPredictor
from tts_preprocess_et.convert import convert_sentence
class Synthesizer:
def __init__(self, preset, checkpoint_path, fast=True):
self._frontend = None
self.use_cuda = torch.cuda.is_available()
self.device = torch.device("cuda" if self.use_cuda else "cpu")
self.model = None
self.preset = preset
self.silence = np.zeros(10000)
# Presets
with open(self.preset) as f:
hparams.parse_json(f.read())
self._frontend = getattr(frontend, hparams.frontend)
train._frontend = self._frontend
self.model = build_model()
# Load checkpoints separately
if self.use_cuda:
checkpoint = torch.load(checkpoint_path)
else:
checkpoint = torch.load(checkpoint_path, map_location=torch.device('cpu'))
self.model.load_state_dict(checkpoint["state_dict"])
self.model.seq2seq.decoder.max_decoder_steps = hparams.max_positions - 2
self.model = self.model.to(self.device)
self.model.eval()
if fast:
self.model.make_generation_fast_()
class Synthesizer:
def __init__(self, config_path: str, checkpoint_path: str, vocoder_path: str):
self.silence = np.zeros(10000, dtype=np.int16)
self.config = Config(config_path=config_path)
self.model = self.config.load_model(checkpoint_path=checkpoint_path)
self.vocoder = HiFiGANPredictor.from_folder(vocoder_path)
def synthesize(self, text, speaker_id=0, threshold=5):
"""Convert text to speech waveform given a deepvoice3 model.
print("Transformer-TTS initialized.")
def synthesize(self, text: str, speed: float = 1):
"""Convert text to speech waveform.
Args:
text (str) : Input text to be synthesized
speaker_id (int)
threshold (int) : Threshold for trimming stuttering at the end. Smaller threshold means more agressive
trimming.
speed (float)
"""
def clean(sent):
sent = re.sub(r'[`´’\']', r'', sent)
sent = re.sub(r'[()]', r', ', sent)
try:
sent = convert_sentence(sent)
except Exception as ex:
print(f'ERROR: {str(ex)}, sentence: {sent}')
sent = re.sub(r'[()[\]:;−­–…—]', r', ', sent)
sent = re.sub(r'[«»“„”]', r'"', sent)
sent = re.sub(r'[*\'\\/-]', r' ', sent)
sent = re.sub(r'[`´’\']', r'', sent)
sent = re.sub(r' +([.,!?])', r'\g<1>', sent)
sent = re.sub(r', ?([.,?!])', r'\g<1>', sent)
sent = re.sub(r'\.+', r'.', sent)
sent = re.sub(r' +', r' ', sent)
sent = re.sub(r'^ | $', r'', sent)
sent = re.sub(r'^, ?', r'', sent)
sent = sent.lower()
sent = re.sub(re.compile(r'\s+'), ' ', sent)
return sent
waveforms = []
# The quotation marks need to be unified, otherwise sentence tokenization won't work
......@@ -63,35 +62,12 @@ class Synthesizer:
sentences = sent_tokenize(text, 'estonian')
for i, sentence in enumerate(tqdm(sentences, unit="sentence")):
sequence = np.array(self._frontend.text_to_sequence(sentence))
sequence = torch.from_numpy(sequence).unsqueeze(0).long().to(self.device)
text_positions = torch.arange(1, sequence.size(-1) + 1).unsqueeze(0).long().to(self.device)
speaker_ids = None if speaker_id is None else torch.LongTensor([speaker_id]).to(self.device)
if text_positions.size()[1] >= hparams.max_positions:
raise ValueError("Input contains sentences that are too long.")
# Greedy decoding
with torch.no_grad():
mel_outputs, linear_outputs, alignments, done = self.model(
sequence, text_positions=text_positions, speaker_ids=speaker_ids)
linear_output = linear_outputs[0].cpu().data.numpy()
alignment = alignments[0].cpu().data.numpy()
# Predicted audio signal
waveform = audio.inv_spectrogram(linear_output.T)
# Cutting predicted signal to remove stuttering from the end of synthesized audio
last_row = np.transpose(alignment)[-1]
repetitions = np.where(last_row > 0)[0]
if repetitions.size > threshold:
end = repetitions[threshold]
end = int(end * len(waveform) / last_row.size)
waveform = waveform[:end]
sentence = clean(sentence)
out = self.model.predict(sentence, speed_regulator=speed)
waveform = self.vocoder([out['mel'].numpy().T])
if i != 0:
waveforms.append(self.silence)
waveforms.append(waveform)
waveforms.append(waveform[0])
waveform = np.concatenate(waveforms)
......@@ -105,19 +81,22 @@ if __name__ == '__main__':
parser.add_argument('input', type=FileType('r'),
help="Input text file to synthesize.")
parser.add_argument('output', type=FileType('w'),
help="Output .wav file path.")
parser.add_argument('--checkpoint', type=FileType('r'), default='models/checkpoint.pth',
help="The checkpoint (model file) to load.")
parser.add_argument('--preset', type=FileType('r'), default='deepvoice3_pytorch/presets/eesti_konekorpus.json',
help="Model preset file.")
parser.add_argument('--speaker-id', type=int, default=0,
help="The ID of the speaker to use for synthesis.")
help="Output .wav file path."),
parser.add_argument('--speaker', type=str, required=True,
help="The name of the speaker to use for synthesis.")
parser.add_argument('--speed', type=int, default=1,
help="Output speed multiplier.")
parser.add_argument('--config', type=FileType('r'), default='config.yaml',
help="The config file to load.")
args = parser.parse_known_args()[0]
synthesizer = Synthesizer(args.preset.name, args.checkpoint.name)
with open(args.config.name, 'r', encoding='utf-8') as f:
config = yaml.load(f, Loader=SafeLoader)['speakers'][args.speaker]
synthesizer = Synthesizer(**config)
with open(args.input.name, 'r', encoding='utf-8') as f:
text = f.read()
waveform = synthesizer.synthesize(text, args.speaker_id)
audio.save_wav(waveform, args.output.name)
waveform = synthesizer.synthesize(text, speed=args.speed)
wavfile.write(args.output.name, 22050, waveform.astype(np.int16))
Subproject commit 90f99e8bdde26f3102a2df97c6266d3e9a93dee7
## Code from https://github.com/as-ideas/TransformerTTS/tree/vocoding
\ No newline at end of file
"""
CODE TAKEN FROM
https://github.com/jik876/hifi-gan
"""
import os
import shutil
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
def build_env(config, config_name, path):
t_path = os.path.join(path, config_name)
if config != t_path:
os.makedirs(path, exist_ok=True)
shutil.copyfile(config, os.path.join(path, config_name))
"""
CODE TAKEN FROM
https://github.com/jik876/hifi-gan
"""
import torch
import torch.nn.functional as F
import torch.nn as nn
from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
from vocoding.hifigan.utils import init_weights, get_padding
LRELU_SLOPE = 0.1
class ResBlock1(torch.nn.Module):
def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
super(ResBlock1, self).__init__()
self.h = h
self.convs1 = nn.ModuleList([
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
padding=get_padding(kernel_size, dilation[0]))),
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
padding=get_padding(kernel_size, dilation[1]))),
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
padding=get_padding(kernel_size, dilation[2])))
])
self.convs1.apply(init_weights)
self.convs2 = nn.ModuleList([
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
padding=get_padding(kernel_size, 1))),
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
padding=get_padding(kernel_size, 1))),
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
padding=get_padding(kernel_size, 1)))
])
self.convs2.apply(init_weights)
def forward(self, x):
for c1, c2 in zip(self.convs1, self.convs2):
xt = F.leaky_relu(x, LRELU_SLOPE)
xt = c1(xt)
xt = F.leaky_relu(xt, LRELU_SLOPE)
xt = c2(xt)
x = xt + x
return x
def remove_weight_norm(self):
for l in self.convs1:
remove_weight_norm(l)
for l in self.convs2:
remove_weight_norm(l)
class ResBlock2(torch.nn.Module):
def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
super(ResBlock2, self).__init__()
self.h = h
self.convs = nn.ModuleList([
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
padding=get_padding(kernel_size, dilation[0]))),
weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
padding=get_padding(kernel_size, dilation[1])))
])
self.convs.apply(init_weights)
def forward(self, x):
for c in self.convs:
xt = F.leaky_relu(x, LRELU_SLOPE)
xt = c(xt)
x = xt + x
return x
def remove_weight_norm(self):
for l in self.convs:
remove_weight_norm(l)
class Generator(torch.nn.Module):
def __init__(self, h):
super(Generator, self).__init__()
self.h = h
self.num_kernels = len(h.resblock_kernel_sizes)
self.num_upsamples = len(h.upsample_rates)
self.conv_pre = weight_norm(Conv1d(80, h.upsample_initial_channel, 7, 1, padding=3))
resblock = ResBlock1 if h.resblock == '1' else ResBlock2
self.ups = nn.ModuleList()
for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
self.ups.append(weight_norm(
ConvTranspose1d(h.upsample_initial_channel//(2**i), h.upsample_initial_channel//(2**(i+1)),
k, u, padding=(k-u)//2)))
self.resblocks = nn.ModuleList()
for i in range(len(self.ups)):
ch = h.upsample_initial_channel//(2**(i+1))
for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
self.resblocks.append(resblock(h, ch, k, d))
self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
self.ups.apply(init_weights)
self.conv_post.apply(init_weights)
def forward(self, x):
x = self.conv_pre(x)
for i in range(self.num_upsamples):
x = F.leaky_relu(x, LRELU_SLOPE)
x = self.ups[i](x)
xs = None
for j in range(self.num_kernels):
if xs is None:
xs = self.resblocks[i*self.num_kernels+j](x)
else:
xs += self.resblocks[i*self.num_kernels+j](x)
x = xs / self.num_kernels
x = F.leaky_relu(x)
x = self.conv_post(x)
x = torch.tanh(x)
return x
def remove_weight_norm(self):
print('Removing weight norm...')
for l in self.ups:
remove_weight_norm(l)
for l in self.resblocks:
l.remove_weight_norm()
remove_weight_norm(self.conv_pre)
remove_weight_norm(self.conv_post)
class DiscriminatorP(torch.nn.Module):
def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
super(DiscriminatorP, self).__init__()
self.period = period
norm_f = weight_norm if use_spectral_norm == False else spectral_norm
self.convs = nn.ModuleList([
norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(2, 0))),
])
self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
def forward(self, x):
fmap = []
# 1d to 2d
b, c, t = x.shape