Commit 5a0f49bf authored by Ubuntu's avatar Ubuntu

added files to koodivaramu

parent d31f6de7
# Neurotõlge
# TartuNLP translate
Mitmekeelne mitme-valdkonnaline neuromasintõlge. Käesolev versioon toetab sisend- ja väljundkeelena seitset keelt: eesti, läti, leedu, inglise, vene, saksa ja soome.
Arengu staatust kajastab [GitHubi leht](https://github.com/tartunlp/kama).
## Kasutamine
Neurotõlge sisaldab skripte ja treenitud mudeleid. Seda saab kasutada otse Linuxi skriptina või lihtsa serverina, mis laeb mudelid mällu ning vastab tõlkimispäringutele.
Käesolev masintõlkepakett sisaldab skripte ja treenitud mudeleid. Seda saab kasutada otse Linuxi skriptina või lihtsa serverina, mis laeb mudelid mällu ning vastab tõlkimispäringutele.
Töötav veebiversioon asub aadressil https://translate.ut.ee, mis töötab nii veebidemona kui ka võimaldab lõimimist SDL Studio, MemoQ ja MemSource tõlkeraamistikega.
......
# TartuNLP translate
Multilingual multi-domain neural machine translation. The current version supports seven languages as input and output language: Estonian, Latvian, Lithuanian, English, Russian, German and Finnish.
Development status is shown on the [GitHub page](https://github.com/tartunlp/kama).
## Usage:
This machine translation package includes scripts and trained models. You can use it directly as a Linux script or as a simple server that loads the models and responds to translation requests.
A working version of the same models is run online at https://translate.ut.ee, which works both as a web demo as well as supports integration with SDL Studio, MemoQ and MemSource.
Neural machine translation works best when it is fine-tuned and customized. If you are interested in such a possibility, contact the [TartuNLP](https://tartunlp.ai) research group!
# Requirements:
```
pip3 install mxnet sentencepiece sockeye
```
# Usage in command-line:
```
cat input_text | ./nmt.py translation_model truecaser_model segmenter_model [output_lang [output_domain]]
translation_model: path to a trained Sockeye model folder (here: models/translation)
truecaser_model: path to a trained TartuNLP truecaser model file (here: models/preprocessing/truecasing.model)
segmenter_model: path to a trained Google SentencePiece model file (here models/preprocessing/sentencepiece.model)
output_lang: output language (one of the following: lt, fi, de, ru, lv, et, en)
output_domain: output domain (one of the following: nc, jr, ep, em, os)
```
Domains, with corpora they are based on:
* nc: news (News crawl corpus)
* jr: legal (JRC-Acquis)
* ep: official speech (Europarl)
* em: medical (EMEA)
* os: subtitles (OPUS OpenSubs)
# Usage as a socket server:
```
./nmt.py translation_model truecaser_model segmenter_model
```
The server uses low-level socket communication; the communication protocol is equivalent to what [Sauron](https://github.com/TartuNLP/sauron) uses.
from collections import defaultdict
_styleConstraints = defaultdict(lambda: None)
_styleConstraints['ep'] = "▁sa ▁Sa ▁SA ▁sina ▁su ▁Su ▁sinu ▁Sinu ▁sul ▁Sul ▁sulle ▁sinuga ▁arvad ▁tahad ▁oled ▁soovid ▁du ▁dir ▁dich ▁dein ▁deine ▁deinen ▁deiner ▁deines ▁du ▁dir ▁dich ▁dein ▁deine ▁deinen ▁deiner ▁deines".split(" ")
_styleConstraints['ep'] += "▁ты ▁Ты ▁тебя ▁тебе ▁Тебе ▁тобой ▁твой ▁твоё ▁твоему ▁твоим ▁твои".split(" ")
_styleConstraints['ep'] += "▁tu ▁Tu ▁tev ▁tevi".split(" ")
_styleConstraints['os'] = "▁te ▁Te ▁teie ▁teid ▁teile ▁Teile ▁teil ▁Teil ▁teilt ▁Teilt ▁Sie ▁Ihne ▁Ihnen ▁Ihner ▁Ihnes ▁Ihn".split(" ")
_styleConstraints['os'] += "▁sir ▁Sir ▁ser ▁Ser".split(" ")
_styleConstraints['os'] += "▁сэр".split(" ")
_styleConstraints['os'] += "▁söör".split(" ")
_styleConstraints['os'] += "▁вы ▁Вы ▁вас ▁Вас ▁вам ▁Вам ▁вами ▁ваш ▁Ваш ▁ваши ▁вашего".split(" ")
_styleConstraints['os'] += "▁jūs ▁Jūs ▁jūsu ▁jums ▁Jums".split(" ")
def getPolitenessConstraints():
return _styleConstraints
import sys
from datetime import datetime
def log(msg):
msg = "[DEBUG {0}] {1}\n".format(datetime.now(), msg)
for channel in (sys.stderr,):
#for channel in (sys.stderr, sys.stdout):
channel.write(msg)
This diff is collapsed.
This diff is collapsed.
!ModelConfig
config_data: !DataConfig
data_statistics: !DataStatistics
average_len_target_per_bucket:
- 6.536928242909892
- 13.184204769653379
- 22.78215330335633
- 32.37527315664415
- 41.68596697000548
- 50.84947788028265
- 59.944743205427585
- 68.92119412335444
- 77.96751959288957
- 87.00002761572998
buckets:
- !!python/tuple
- 10
- 10
- !!python/tuple
- 20
- 20
- !!python/tuple
- 30
- 30
- !!python/tuple
- 40
- 40
- !!python/tuple
- 50
- 50
- !!python/tuple
- 60
- 60
- !!python/tuple
- 70
- 70
- !!python/tuple
- 80
- 80
- !!python/tuple
- 90
- 90
- !!python/tuple
- 100
- 100
length_ratio_mean: 1.0514363227034098
length_ratio_stats_per_bucket:
- !!python/tuple
- 1.045709571865215
- 0.32830897165548817
- !!python/tuple
- 1.0707990169687622
- 0.4338813610457235
- !!python/tuple
- 1.0560343657673241
- 0.39781163738960607
- !!python/tuple
- 1.0380085956349387
- 0.324353860351208
- !!python/tuple
- 1.0314125044533415
- 0.2892601348508189
- !!python/tuple
- 1.029726634263863
- 0.27607792859900804
- !!python/tuple
- 1.0308724126318582
- 0.2801743524682548
- !!python/tuple
- 1.033674946782453
- 0.29573625023977324
- !!python/tuple
- 1.03455966298486
- 0.302222607622533
- !!python/tuple
- 1.0381309333356483
- 0.3468416947461553
length_ratio_std: 0.36694990539198924
max_observed_len_source: 100
max_observed_len_target: 100
num_discarded: 537866
num_sents: 54269486
num_sents_per_bucket:
- 15771208
- 15821485
- 8167349
- 5501422
- 3688284
- 2325233
- 1394091
- 830065
- 480659
- 289690
num_tokens_source: 1151311751
num_tokens_target: 1151311685
num_unks_source: 0
num_unks_target: 0
size_vocab_source: 51615
size_vocab_target: 51615
max_seq_len_source: 100
max_seq_len_target: 100
num_source_factors: 5
source_with_eos: true
config_decoder: !TransformerConfig
act_type: relu
attention_heads: 8
conv_config: null
dropout_act: 0.1
dropout_attention: 0.1
dropout_prepost: 0.1
dtype: float32
feed_forward_num_hidden: 2048
lhuc: false
max_seq_len_source: 100
max_seq_len_target: 100
model_size: 512
num_layers: 6
positional_embedding_type: fixed
postprocess_sequence: dr
preprocess_sequence: n
use_lhuc: false
config_embed_source: !EmbeddingConfig
dropout: 0.0
dtype: float32
factor_configs:
- !FactorConfig
_frozen: false
num_embed: 4
vocab_size: 15
- !FactorConfig
_frozen: false
num_embed: 4
vocab_size: 21
- !FactorConfig
_frozen: false
num_embed: 4
vocab_size: 69
- !FactorConfig
_frozen: false
num_embed: 4
vocab_size: 69
num_embed: 512
num_factors: 5
source_factors_combine: concat
vocab_size: 51615
config_embed_target: !EmbeddingConfig
dropout: 0.0
dtype: float32
factor_configs: null
num_embed: 512
num_factors: 1
source_factors_combine: concat
vocab_size: 51615
config_encoder: !TransformerConfig
act_type: relu
attention_heads: 8
conv_config: null
dropout_act: 0.1
dropout_attention: 0.1
dropout_prepost: 0.1
dtype: float32
feed_forward_num_hidden: 2048
lhuc: false
max_seq_len_source: 100
max_seq_len_target: 100
model_size: 528
num_layers: 6
positional_embedding_type: fixed
postprocess_sequence: dr
preprocess_sequence: n
use_lhuc: false
config_length_task: null
config_length_task_loss: null
config_loss: !LossConfig
label_smoothing: 0.1
length_task_link: null
length_task_weight: 1.0
name: cross-entropy
normalization_type: valid
vocab_size: 51615
lhuc: false
num_pointers: 0
vocab_source_size: 51615
vocab_target_size: 51615
weight_normalization: false
weight_tying: false
weight_tying_type: null
corpus factors:
nc - NewsCommentary + ParaCrawl
os - OpenSubtitles
ep - Europarl + MultiUN
1.18.106
\ No newline at end of file
This diff is collapsed.
{
"<pad>": 0,
"<unk>": 1,
"<s>": 2,
"</s>": 3,
"en": 4,
"de": 5,
"et": 6,
"lt": 7,
"lv": 8,
"fi": 9,
"ru": 10,
"l4": 11,
"l3": 12,
"l2": 13,
"l1": 14
}
\ No newline at end of file
{
"<pad>": 0,
"<unk>": 1,
"<s>": 2,
"</s>": 3,
"dg": 4,
"ep": 5,
"jr": 6,
"em": 7,
"os": 8,
"pc": 9,
"nc": 10,
"d9": 11,
"d8": 12,
"d7": 13,
"d6": 14,
"d5": 15,
"d4": 16,
"d3": 17,
"d2": 18,
"d10": 19,
"d1": 20
}
\ No newline at end of file
{
"<pad>": 0,
"<unk>": 1,
"<s>": 2,
"</s>": 3,
"f0": 4,
"▁f9": 5,
"▁f8": 6,
"▁f7": 7,
"▁f6": 8,
"▁f5": 9,
"▁f4": 10,
"▁f3": 11,
"▁f2": 12,
"▁f15": 13,
"▁f14": 14,
"▁f13": 15,
"▁f12": 16,
"▁f11": 17,
"▁f10": 18,
"▁f1": 19,
"▁f0": 20,
"▁f63": 21,
"▁f62": 22,
"▁f61": 23,
"▁f60": 24,
"▁f59": 25,
"▁f58": 26,
"▁f57": 27,
"▁f56": 28,
"▁f55": 29,
"▁f54": 30,
"▁f53": 31,
"▁f52": 32,
"▁f51": 33,
"▁f50": 34,
"▁f49": 35,
"▁f48": 36,
"▁f47": 37,
"▁f46": 38,
"▁f45": 39,
"▁f44": 40,
"▁f43": 41,
"▁f42": 42,
"▁f41": 43,
"▁f40": 44,
"▁f39": 45,
"▁f38": 46,
"▁f37": 47,
"▁f36": 48,
"▁f35": 49,
"▁f34": 50,
"▁f33": 51,
"▁f32": 52,
"▁f31": 53,
"▁f30": 54,
"▁f29": 55,
"▁f28": 56,
"▁f27": 57,
"▁f26": 58,
"▁f25": 59,
"▁f24": 60,
"▁f23": 61,
"▁f22": 62,
"▁f21": 63,
"▁f20": 64,
"▁f19": 65,
"▁f18": 66,
"▁f17": 67,
"▁f16": 68
}
\ No newline at end of file
{
"<pad>": 0,
"<unk>": 1,
"<s>": 2,
"</s>": 3,
"g0": 4,
"▁g9": 5,
"▁g8": 6,
"▁g7": 7,
"▁g6": 8,
"▁g5": 9,
"▁g4": 10,
"▁g3": 11,
"▁g2": 12,
"▁g15": 13,
"▁g14": 14,
"▁g13": 15,
"▁g12": 16,
"▁g11": 17,
"▁g10": 18,
"▁g1": 19,
"▁g0": 20,
"▁g63": 21,
"▁g62": 22,
"▁g61": 23,
"▁g60": 24,
"▁g59": 25,
"▁g58": 26,
"▁g57": 27,
"▁g56": 28,
"▁g55": 29,
"▁g54": 30,
"▁g53": 31,
"▁g52": 32,
"▁g51": 33,
"▁g50": 34,
"▁g49": 35,
"▁g48": 36,
"▁g47": 37,
"▁g46": 38,
"▁g45": 39,
"▁g44": 40,
"▁g43": 41,
"▁g42": 42,
"▁g41": 43,
"▁g40": 44,
"▁g39": 45,
"▁g38": 46,
"▁g37": 47,
"▁g36": 48,
"▁g35": 49,
"▁g34": 50,
"▁g33": 51,
"▁g32": 52,
"▁g31": 53,
"▁g30": 54,
"▁g29": 55,
"▁g28": 56,
"▁g27": 57,
"▁g26": 58,
"▁g25": 59,
"▁g24": 60,
"▁g23": 61,
"▁g22": 62,
"▁g21": 63,
"▁g20": 64,
"▁g19": 65,
"▁g18": 66,
"▁g17": 67,
"▁g16": 68
}
\ No newline at end of file
This diff is collapsed.
#!/usr/bin/python3
import sock
import translator
import sys
import html
import json
from time import time
from nltk import sent_tokenize
from constraints import getPolitenessConstraints as getCnstrs
from log import log
# IP and port for the server
MY_IP = 'localhost'
MY_PORT = 12346
supportedStyles = { "os", "un", "dg", "jr", "ep", "pc", "em", "nc" }
supportedOutLangs = { 'et', 'lv', 'en', 'ru', 'fi', 'lt', 'de' }
extraSupportedOutLangs = { 'est': 'et', 'lav': 'lv', 'eng': 'en', 'rus': 'ru', 'fin': 'fi', 'lit': 'lt', 'ger': 'de' }
defaultStyle = 'nc'
defaultOutLang = 'en'
USAGE_MSG = """\nUsage: nmtnazgul.py translation_model truecaser_model segmenter_model [output_lang [output_style]]
translation_model: path to a trained Sockeye model folder
truecaser_model: path to a trained TartuNLP truecaser model file
segmenter_model: path to a trained Google SentencePiece model file
Without the output language and any further parameters an NMT server is started; otherwise the script translates STDIN
output_lang: output language (one of the following: {0})
output_style: output style (one of the following: {1}; default: {2})
Further info: http://github.com/tartunlp/nazgul\n\n""".format(", ".join(list(supportedOutLangs)), ", ".join(list(supportedStyles)), defaultStyle)
#############################################################################################
###################################### STDIN and Server #####################################
#############################################################################################
def getConf(rawConf):
style = defaultStyle
outlang = defaultOutLang
for field in rawConf.split(','):
if field in supportedStyles:
style = field
if field in supportedOutLangs:
outlang = field
if field in extraSupportedOutLangs:
outlang = extraSupportedOutLangs[field]
return style, outlang
def parseInput(rawText):
global supportedStyles, defaultStyle, supportedOutLangs, defaultOutLang
try:
fullText = rawText['src']
rawStyle, rawOutLang = getConf(rawText['conf'])
livesubs = "|" in fullText
sentences = fullText.split("|") if livesubs else sent_tokenize(fullText)
delim = "|" if livesubs else " "
except KeyError:
sentences = rawText['sentences']
rawStyle = rawText['outStyle']
rawOutLang = rawText['outLang']
delim = False
if rawStyle not in supportedStyles:
#raise ValueError("style bad: " + rawStyle)
rawStyle = defaultStyle
if rawOutLang not in supportedOutLangs:
#raise ValueError("out lang bad: " + rawOutLang)
rawOutLang = defaultOutLang
outputLang = rawOutLang
outputStyle = rawStyle
return sentences, outputLang, outputStyle, delim
def decodeRequest(rawMessage):
struct = json.loads(rawMessage.decode('utf-8'))
segments, outputLang, outputStyle, delim = parseInput(struct)
return segments, outputLang, outputStyle, delim
def encodeResponse(translationList, delim):
translationText = delim.join(translationList)
result = json.dumps({'raw_trans': ['-'],
'raw_input': ['-'],
'final_trans': translationText})
return bytes(result, 'utf-8')
def serverTranslationFunc(rawMessage, models):
segments, outputLang, outputStyle, delim = decodeRequest(rawMessage)
translations, _, _, _ = translator.translate(models, segments, outputLang, outputStyle, getCnstrs())
return encodeResponse(translations, delim)
def startTranslationServer(models, ip, port):
log("started server")
# start listening as a socket server; apply serverTranslationFunc to incoming messages to genereate the response
sock.startServer(serverTranslationFunc, (models,), port = port, host = ip)
def translateStdinInBatches(models, outputLang, outputStyle):
"""Read lines from STDIN and treat each as a segment to translate;
translate them and print out tab-separated scores (decoder log-prob)
and the translation outputs"""
#read STDIN as a list of segments
lines = [line.strip() for line in sys.stdin]
#translate segments and get translations and scores
translations, scores, _, _ = translator.translate(models, lines, outputLang, outputStyle, getCnstrs())
#print each score and translation, separated with a tab
for translation, score in zip(translations, scores):
print("{0}\t{1}".format(score, translation))
#############################################################################################
################################## Cmdline and main block ###################################
#############################################################################################
def readCmdlineModels():
"""Read translation, truecaser and segmenter model paths from cmdline;
show usage info if failed"""
#This is a quick hack for reading cmdline args, should use argparse instead
try:
translationModelPath = sys.argv[1]
truecaserModelPath = sys.argv[2]
segmenterModelPath = sys.argv[3]
except IndexError:
sys.stderr.write(USAGE_MSG)
sys.exit(-1)
return translationModelPath, truecaserModelPath, segmenterModelPath
def readLangAndStyle():
"""Read output language and style off cmdline.
Language is optional -- if not given, a server is started.
Style is optional -- if not given, default (auto) is used."""
# EAFP
try:
outputLanguage = sys.argv[4]