Commit 611a80d9 authored by Liisa Rätsep's avatar Liisa Rätsep

TTS API v1.0

parent b0296a8e
/.idea/
config.json
server.wsgi
*.pth
__pycache__/
\ No newline at end of file
[submodule "deepvoice3_pytorch"]
path = deepvoice3_pytorch
url = https://github.com/TartuNLP/deepvoice3_pytorch
# voice
# Eestikeelse kõnesünteesi API
Siit repositooriumist leiab lihtsa API, mis võimaldab käivitada eestikeelse mitmehäälse kõnesünteesi
serverit. Kood sisaldab alammooduleid, mis viitavad järgmistele kõnesünteesi komponentidele:
- [eesti keelele kohandatud Deep Voice 3](https://github.com/TartuNLP/deepvoice3_pytorch)
- [eestikeelse kõnesünteesi eeltöötlus](https://github.com/TartuNLP/tts_preprocess_et)
Kõnesüntees on loodud koostöös [Eesti Keele Instituudiga](http://portaal.eki.ee/) ja seda on võimalik
kasutada ka meie [veebidemos](https://www.neurokone.ee).
## Kasutamine
API kasutamiseks tuleb veebiserverile saata järgmises formaadis POST päring, kus parameeter `text` viitab sünteesitavale tekstile ja `speaker_id` soovitud häälele.
POST `/api/v1.0/synthesize`
BODY (JSON):
```
{
"text": "Tere."
"speaker_id": 0
}
```
Server tagastab binaarsel kujul .wav formaadis helifaili. Parameeter `speaker_id` ei ole kohtustuslik ning vaikimisi kasutatakse esimest häält.
Käesolevas versioonis viidatud [mudel](https://github.com/TartuNLP/deepvoice3_pytorch/releases/tag/kratt-v1.0) toetab
kuut erinevat häält.
## Nõuded ja seadistamine
Siinseid instruktsioone on testitud Ubuntu 18.04-ga. Kood on nii CPU- kui GPU-sõbralik.
- Veendu, et järgmised komponendid on installitud:
- Conda (loe lähemalt: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html)
- GNU Compiler Collection (jooksuta: `sudo apt install build-essential`)
- Klooni see repositoorium koos alammoodulitega:
```
git clone --recurse-submodules https://koodivaramu.eesti.ee/tartunlp/text-to-speech
```
- Loo ja aktiveeri Conda keskond:
```
cd text-to-speech
conda env create -f environment.yml
conda activate deepvoice
pip install --no-deps -e "deepvoice3_pytorch/[bin]"
python -c 'import nltk; nltk.download("punkt"); nltk.download("cmudict")'
```
- Lae alla meie [Deep Voice 3 mudel](https://github.com/TartuNLP/deepvoice3_pytorch/releases/download/kratt-v1.0/checkpoint.pth)
- Loo konfiguratsiooni fail. Kontrolli, et parameeter `checkpoint` viitaks eelmises punktis alla laetud
mudeli failile.
```
cp config.sample.json config.json
```
Seadista veebiserveri, mis jooksutaks `tts_server.py` faili või testi API kasutust nii:
```
export FLASK_APP=tts_server.py
flask run
```
\ No newline at end of file
# Estonian Text-to-Speech API
A simple Flask API for Estonian multispeaker speech synthesis. This repository contains the following submodules:
- [Deep Voice 3 adaptation for Estonian](https://github.com/TartuNLP/deepvoice3_pytorch)
- [Estonian text-to-speech preprocessing scripts](https://github.com/TartuNLP/tts_preprocess_et)
Speech synthesis was developed in collaboration with the [Estonian Language Institute](http://portaal.eki.ee/) and
can also be used via our [web demo](https://www.neurokone.ee).
## API usage
To use the API, use the following POST request format.
POST `/api/v1.0/synthesize`
BODY (JSON):
```
{
"text": "Tere."
"speaker_id": 0
}
```
Upon such request, the server will return a binary stream of the synthesized audio in .wav format. The `speaker_id
` parameter is optional and by default, the first speaker is selected.
The [model](https://github.com/TartuNLP/deepvoice3_pytorch/releases/tag/kratt-v1.0) we reference to in this version
supports six different speakers.
## Requirements and installation
The following installation instructions have been tested on Ubuntu 18.04. The code is both CPU and GPU compatible.
- Make sure you have the following prerequisites installed:
- Conda (see https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html)
- GNU Compiler Collection (run `sudo apt install build-essential`)
- Clone with submodules
```
git clone --recurse-submodules https://koodivaramu.eesti.ee/tartunlp/text-to-speech
```
- Create and activate a Conda environment with all dependencies.
```
cd text-to-speech
conda env create -f environment.yml
conda activate deepvoice
pip install --no-deps -e "deepvoice3_pytorch/[bin]"
python -c 'import nltk; nltk.download("punkt"); nltk.download("cmudict")'
```
- Download our [Deep Voice 3 model](https://github.com/TartuNLP/deepvoice3_pytorch/releases/download/kratt-v1.0/checkpoint.pth)
- Create a configuration file and change any defaults as needed. Make sure that the `checkpoint` parameter points to
the model file you just downloaded.
```
cp config.sample.json config.json
```
Configure a web server to run `tts_server.py` or test the API with:
```
export FLASK_APP=tts_server.py
flask run
```
\ No newline at end of file
{
"checkpoint": "checkpoint.pth",
"preset": "deepvoice3_pytorch/presets/eesti_konekorpus.json",
"trim_threshold": 5,
"host": "0.0.0.0",
"port": 80,
"allowed_speakers": [0,1,2,3,4,5]
}
\ No newline at end of file
Subproject commit a46a4dce9dbce1761b1a3bcf8820f59a4841ee2f
name: deepvoice
channels:
- anaconda
- estnltk
- conda-forge
- defaults
dependencies:
- python=3.6.4
- numpy=1.17.4=py36hc1035e2_0
- scipy==1.3.2
- pytorch=1.3.1=cuda92py36hb0ba70e_0
- Unidecode==1.1.1
- inflect==4.0.0
- librosa==0.7.1
- numba==0.47.0
- nltk==3.2.5
- docopt==0.6.2
- tensorboardx=1.2
- estnltk=1.6.5beta=3.6
- scikit-learn==0.22.1
- flask=0.12.2
- flask-cors=3.0.7
- flask-restful=0.3.7
- keras=2.2.4
- pandas=0.23.4
- pip
- pip:
- lws
- nnmnkwii>=0.0.19
# coding: utf-8
import io
import re
import numpy as np
import torch
from hparams import hparams
from deepvoice3_pytorch import frontend
from train import build_model
import audio
import train
from nltk import sent_tokenize
class Synthesizer:
def __init__(self, preset, checkpoint_path, fast=True):
self._frontend = None
self.use_cuda = torch.cuda.is_available()
self.device = torch.device("cuda" if self.use_cuda else "cpu")
self.model = None
self.preset = preset
self.silence = np.zeros(10000)
# Presets
with open(self.preset) as f:
hparams.parse_json(f.read())
self._frontend = getattr(frontend, hparams.frontend)
train._frontend = self._frontend
self.model = build_model()
# Load checkpoints separately
if self.use_cuda:
checkpoint = torch.load(checkpoint_path)
else:
checkpoint = torch.load(checkpoint_path, map_location=torch.device('cpu'))
self.model.load_state_dict(checkpoint["state_dict"])
# TODO handling longer inputs
self.model.seq2seq.decoder.max_decoder_steps = hparams.max_positions-2
self.model = self.model.to(self.device)
self.model.eval()
if fast:
self.model.make_generation_fast_()
def synthesize(self, text, speaker_id=0, threshold=5):
"""Convert text to speech waveform given a deepvoice3 model.
Args:
text (str) : Input text to be synthesized
speaker_id (int)
threshold (int) : Threshold for trimming stuttering at the end. Smaller threshold means more agressive
trimming.
"""
waveforms = []
# The quotation marks need to be unified, otherwise sentence tokenization won't work
text = re.sub(r'[«»“„]', r'"', text)
for i, sentence in enumerate(sent_tokenize(text, 'estonian')):
sequence = np.array(self._frontend.text_to_sequence(sentence))
sequence = torch.from_numpy(sequence).unsqueeze(0).long().to(self.device)
text_positions = torch.arange(1, sequence.size(-1) + 1).unsqueeze(0).long().to(self.device)
speaker_ids = None if speaker_id is None else torch.LongTensor([speaker_id]).to(self.device)
if text_positions.size()[1] >= hparams.max_positions:
raise ValueError("Input contains sentences that are too long.")
# Greedy decoding
with torch.no_grad():
mel_outputs, linear_outputs, alignments, done = self.model(
sequence, text_positions=text_positions, speaker_ids=speaker_ids)
linear_output = linear_outputs[0].cpu().data.numpy()
alignment = alignments[0].cpu().data.numpy()
# Predicted audio signal
waveform = audio.inv_spectrogram(linear_output.T)
# Cutting predicted signal to remove stuttering from the end of synthesized audio
last_row = np.transpose(alignment)[-1]
repetitions = np.where(last_row > 0)[0]
if repetitions.size > threshold:
end = repetitions[threshold]
end = int(end * len(waveform) / last_row.size)
waveform = waveform[:end]
if i != 0:
waveforms.append(self.silence)
waveforms.append(waveform)
waveform = np.concatenate(waveforms)
out = io.BytesIO()
audio.save_wav(waveform, out)
return out
# coding: utf-8
import os
import json
from flask_cors import CORS
from flask import Flask, send_file, jsonify
from flask_restful import Api, Resource, reqparse, abort
from hparams import hparams, hparams_debug_string
from synthesizer import Synthesizer
with open('config.json') as config_file:
config = json.load(config_file)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads'
api = Api(app)
CORS(app)
with open(config['preset']) as f:
hparams.parse_json(f.read())
print(hparams_debug_string())
synthesizer = Synthesizer(config['preset'], config['checkpoint'])
print("Deepvoice3 initialized.")
@app.route('/')
def index():
return "Eestikeelse kõnesünteesi API"
def synthesize(args):
text = args.get('text')
speaker_id = args.get('speaker_id')
if text == '' or speaker_id not in config['allowed_speakers']:
speaker_id = config['allowed_speakers'][0]
try:
return synthesizer.synthesize(text, speaker_id, config['trim_threshold'])
except ValueError:
abort(413)
class AudioAPI(Resource):
def __init__(self):
self.reqparse = reqparse.RequestParser()
self.reqparse.add_argument('text', type=str, required=True, help='No text provided', location='json')
self.reqparse.add_argument('speaker_id', type=int, default=config['allowed_speakers'][0],
help='No speaker id provided', location='json')
super(AudioAPI, self).__init__()
def post(self):
data = synthesize(self.reqparse.parse_args())
return send_file(data, mimetype='audio/wav')
class AudioAPIJSON(Resource):
def __init__(self):
self.reqparse = reqparse.RequestParser()
self.reqparse.add_argument('text', type=str, required=True, help='No text provided', location='json')
self.reqparse.add_argument('speaker_id', type=int, default=config['allowed_speakers'][0],
help='No speaker id provided', location='json')
super(AudioAPIJSON, self).__init__()
def post(self):
data = synthesize(self.reqparse.parse_args())
byte_str = data.getvalue()
new_data = byte_str.decode('ISO-8859-1')
return jsonify({'audio': new_data})
api.add_resource(AudioAPI, '/api/v1.0/synthesize', endpoint='audio')
api.add_resource(AudioAPIJSON, '/api/v1.0/synthesize/json', endpoint='audio-json')
if __name__ == '__main__':
app.run(config['host'], config['port'], debug=True)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment