larryvrh
/

mt5-translation-ja_zh

text2text-generation

Model card Files Files and versions

larryvrh commited on Apr 11, 2023

Commit

a6a4d59

·

1 Parent(s): 51bf590

Update README.md

Files changed (1) hide show

README.md +45 -1

README.md CHANGED Viewed

@@ -18,4 +18,48 @@ pipeline_tag: translation
 This is the finetuned version of [google/mt5-large](https://huggingface.co/google/mt5-large) for translating Japanese into Simplified Chinese.
-Trained for 1 epoch on 5680000 samples from [CCMatrix-v1-Ja_Zh-filtered](https://huggingface.co/datasets/larryvrh/CCMatrix-v1-Ja_Zh-filtered) and 690095 samples from [WikiMatrix-v1-Ja_Zh-filtered](https://huggingface.co/datasets/larryvrh/WikiMatrix-v1-Ja_Zh-filtered).

 This is the finetuned version of [google/mt5-large](https://huggingface.co/google/mt5-large) for translating Japanese into Simplified Chinese.
+Trained for 1 epoch on 5680000 samples from [CCMatrix-v1-Ja_Zh-filtered](https://huggingface.co/datasets/larryvrh/CCMatrix-v1-Ja_Zh-filtered) and 690095 samples from [WikiMatrix-v1-Ja_Zh-filtered](https://huggingface.co/datasets/larryvrh/WikiMatrix-v1-Ja_Zh-filtered).
+This model is trained on sentence pairs with max seq_len=128, therefore you need to break document into small sentences before inference in order to avoid performance degradation.
+---
+Demo Usage
+```python
+from transformers import pipeline
+import re
+pipe = pipeline(model="larryvrh/mt5-translation-ja_zh")
+def translate_sentence(sentence):
+    return pipe(f'<-ja2zh-> {sentence}')[0]['translation_text']
+def translate_paragraph(paragraph):
+    sentences = []
+    cursor = 0
+    for i, c in enumerate(paragraph):
+        if c == '。':
+            sentences.append(paragraph[cursor:i + 1])
+            cursor = i + 1
+    if paragraph[-1] != '。':
+        sentences.append(paragraph[cursor:])
+    return ''.join(translate_sentence(s) for s in sentences)
+def translate_article(article):
+    paragraphs = re.split(r'([\r\n]+)', article)
+    for i, p in enumerate(paragraphs):
+        if len(p.strip()) == 0:
+            continue
+        paragraphs[i] = translate_paragraph(p)
+    return ''.join(paragraphs)
+article = '''文は、「主語・修飾語・述語」の語順で構成される。修飾語は被修飾語の前に位置する。また、名詞の格を示すためには、語順や語尾を変化させるのでなく、文法的な機能を示す機能語（助詞）を後ろに付け加える（膠着させる）。これらのことから、言語類型論上は、語順の点ではSOV型の言語に、形態の点では膠着語に分類される（「文法」の節参照）。
+語彙は、古来の大和言葉（和語）のほか、漢語（字音語）、外来語、および、それらの混ざった混種語に分けられる。字音語（漢字の音読みに由来する語の意、一般に「漢語」と称する）は現代の語彙の一部分を占めている。また、「絵/画（ゑ）」など、もともと音であるが和語と認識されているものもある。さらに近代以降には西洋由来の語を中心とする外来語が増大している（「語種」の節参照）。'''
+print(translate_article(article))
+```