Noisy
Contents
Noisy#
This tutorial is available as an IPython notebook at Malaya/example/noisy-translation.
This module trained on both standard and local (included social media) language structures, so it is save to use for both.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
%%time
import malaya
import logging
logging.basicConfig(level=logging.INFO)
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CPU times: user 2.86 s, sys: 3.84 s, total: 6.7 s
Wall time: 1.93 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
List available HuggingFace models#
[3]:
malaya.translation.available_huggingface
[3]:
{'mesolitica/translation-t5-tiny-standard-bahasa-cased': {'Size (MB)': 139,
'Suggested length': 1536,
'en-ms chrF2++': 65.91,
'ms-en chrF2++': 61.3,
'ind-ms chrF2++': 58.15,
'jav-ms chrF2++': 49.33,
'pasar ms-ms chrF2++': 58.46,
'pasar ms-en chrF2++': 55.76,
'manglish-ms chrF2++': 51.04,
'manglish-en chrF2++': 52.2,
'from lang': ['en', 'ms', 'ind', 'jav', 'bjn', 'manglish', 'pasar ms'],
'to lang': ['en', 'ms']},
'mesolitica/translation-t5-small-standard-bahasa-cased': {'Size (MB)': 242,
'Suggested length': 1536,
'en-ms chrF2++': 67.37,
'ms-en chrF2++': 63.79,
'ind-ms chrF2++': 58.09,
'jav-ms chrF2++': 52.11,
'pasar ms-ms chrF2++': 62.49,
'pasar ms-en chrF2++': 60.77,
'manglish-ms chrF2++': 52.84,
'manglish-en chrF2++': 53.65,
'from lang': ['en', 'ms', 'ind', 'jav', 'bjn', 'manglish', 'pasar ms'],
'to lang': ['en', 'ms']},
'mesolitica/translation-t5-base-standard-bahasa-cased': {'Size (MB)': 892,
'Suggested length': 1536,
'en-ms chrF2++': 67.62,
'ms-en chrF2++': 64.41,
'ind-ms chrF2++': 59.25,
'jav-ms chrF2++': 52.86,
'pasar ms-ms chrF2++': 62.99,
'pasar ms-en chrF2++': 62.06,
'manglish-ms chrF2++': 54.4,
'manglish-en chrF2++': 54.14,
'from lang': ['en', 'ms', 'ind', 'jav', 'bjn', 'manglish', 'pasar ms'],
'to lang': ['en', 'ms']},
'mesolitica/translation-nanot5-tiny-malaysian-cased': {'Size (MB)': 205,
'Suggested length': 2048,
'en-ms chrF2++': 63.61,
'ms-en chrF2++': 59.55,
'ind-ms chrF2++': 56.38,
'jav-ms chrF2++': 47.68,
'mandarin-ms chrF2++': 36.61,
'mandarin-en chrF2++': 39.78,
'pasar ms-ms chrF2++': 58.74,
'pasar ms-en chrF2++': 54.87,
'manglish-ms chrF2++': 50.76,
'manglish-en chrF2++': 53.16,
'from lang': ['en',
'ms',
'ind',
'jav',
'bjn',
'manglish',
'pasar ms',
'mandarin',
'pasar mandarin'],
'to lang': ['en', 'ms']},
'mesolitica/translation-nanot5-small-malaysian-cased': {'Size (MB)': 358,
'Suggested length': 2048,
'en-ms chrF2++': 66.98,
'ms-en chrF2++': 63.52,
'ind-ms chrF2++': 58.1,
'jav-ms chrF2++': 51.55,
'mandarin-ms chrF2++': 46.09,
'mandarin-en chrF2++': 44.13,
'pasar ms-ms chrF2++': 63.2,
'pasar ms-en chrF2++': 59.78,
'manglish-ms chrF2++': 54.09,
'manglish-en chrF2++': 55.27,
'from lang': ['en',
'ms',
'ind',
'jav',
'bjn',
'manglish',
'pasar ms',
'mandarin',
'pasar mandarin'],
'to lang': ['en', 'ms']},
'mesolitica/translation-nanot5-base-malaysian-cased': {'Size (MB)': 990,
'Suggested length': 2048,
'en-ms chrF2++': 67.87,
'ms-en chrF2++': 64.79,
'ind-ms chrF2++': 56.98,
'jav-ms chrF2++': 51.21,
'mandarin-ms chrF2++': 47.39,
'mandarin-en chrF2++': 48.78,
'pasar ms-ms chrF2++': 65.06,
'pasar ms-en chrF2++': 64.03,
'manglish-ms chrF2++': 57.91,
'manglish-en chrF2++': 55.66,
'from lang': ['en',
'ms',
'ind',
'jav',
'bjn',
'manglish',
'pasar ms',
'mandarin',
'pasar mandarin'],
'to lang': ['en', 'ms']}}
[4]:
print(malaya.translation.info)
1. tested on FLORES200 pair `dev` set, https://github.com/huseinzol05/malay-dataset/tree/master/translation/flores200-eval
2. tested on noisy test set, https://github.com/huseinzol05/malay-dataset/tree/master/translation/noisy-eval
3. check out NLLB 200 metrics from `malaya.translation.nllb_metrics`.
4. check out Google Translate metrics from `malaya.translation.google_translate_metrics`.
Improvements of new model#
able to translate
[en, ms, ind, jav, bjn, manglish, pasar ms, mandarin, pasar mandarin]
while old model only able to translate[en, ms, pasar ms]
.No longer required
from_lang
part of the prefix.able to retain text structure as it is.
Load Transformer models#
def huggingface(
model: str = 'mesolitica/translation-t5-small-standard-bahasa-cased',
force_check: bool = True,
from_lang: List[str] = None,
to_lang: List[str] = None,
old_model: bool = False,
**kwargs,
):
"""
Load HuggingFace model to translate.
Parameters
----------
model: str, optional (default='mesolitica/translation-t5-small-standard-bahasa-cased')
Check available models at `malaya.translation.available_huggingface()`.
force_check: bool, optional (default=True)
Force check model one of malaya model.
Set to False if you have your own huggingface model.
Returns
-------
result: malaya.torch_model.huggingface.Translation
"""
[5]:
model = malaya.translation.huggingface(model = 'mesolitica/translation-nanot5-small-malaysian-cased')
Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Translate#
def generate(self, strings: List[str], to_lang: str = 'ms', **kwargs):
"""
Generate texts from the input.
Parameters
----------
strings : List[str]
to_lang: str, optional (default='ms')
target language to translate.
**kwargs: vector arguments pass to huggingface `generate` method.
Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
If you are using `use_ctranslate2`, vector arguments pass to ctranslate2 `translate_batch` method.
Read more at https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?highlight=translate_batch#ctranslate2.Translator.translate_batch
Returns
-------
result: List[str]
"""
[6]:
from pprint import pprint
Noisy malay#
[7]:
strings = [
'ak tak paham la',
'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:',
"Memanglah. Ini tak payah expert, aku pun tau. It's a gesture, bodoh.",
'jam 8 di pasar KK memang org ramai 😂, pandai dia pilih tmpt.',
'Jadi haram jadah😀😃🤭',
'nak gi mana tuu',
'Macam nak ambil half day',
"Bayangkan PH dan menang pru-14. Pastu macam-macam pintu belakang ada. Last-last Ismail Sabri naik. That's why I don't give a fk about politics anymore. Sumpah dah fk up dah.",
'mesolitica boleh buat asr tak',
]
[8]:
%%time
pprint(model.generate(strings, to_lang = 'ms', max_length = 1000))
spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.
['Saya tidak faham',
'Hi guys! Saya perasan semalam dan hari ini ramai yang dapat cookies ni kan. '
'Jadi hari ini saya ingin berkongsi beberapa post mortem dari batch pertama '
'kami:',
'Memanglah. Ini tidak perlu pakar, saya juga tahu. Ini adalah isyarat, bodoh.',
'Jam 8 di pasar KK memang ramai orang 😂, pandai dia pilih tempat.',
'Jadi haram jadah 😀😃🤭',
'Ke mana kamu pergi?',
'Saya ingin mengambil separuh hari',
'Bayangkan PH dan menang dalam PRU-14. Kemudian terdapat pelbagai pintu '
'belakang. Akhirnya, Ismail Sabri naik. Itulah sebabnya saya tidak lagi '
'peduli tentang politik. Saya bersumpah saya sudah pergi.',
'Bolehkah mesolitica digunakan untuk membuat asr?']
CPU times: user 34.7 s, sys: 47 ms, total: 34.8 s
Wall time: 2.94 s
[9]:
%%time
pprint(model.generate(strings, to_lang = 'en', max_length = 1000))
["I don't understand",
'Hi guys! I noticed that many people have received cookies yesterday and '
'today. So today I want to share some post mortem of our first batch:',
"Indeed. No need for an expert, I know. It's a gesture, stupid.",
"At 8 o'clock in the KK market, it's crowded 😂, he's clever in choosing a "
'place.',
"So it's illegal😀😃🤭",
'Where are you going?',
'How to take half a day',
'Imagine PH and winning the 14th general election. Then there are all sorts '
"of backgazes. In the end, Ismail Sabri got in. That's why I don't care about "
"politics anymore. I swear I'm already fucked up.",
'Can the mesolitica make Asr?']
CPU times: user 51.6 s, sys: 89.9 ms, total: 51.7 s
Wall time: 4.41 s
Manglish#
[10]:
strings = [
'i know plenty of people who snack on sambal ikan bilis.',
'I often visualize my own programs algorithm before implemment it.',
'Am I the only one who used their given name ever since I was a kid?',
'Gotta be wary of pimples. Oh they bleed bad when cut',
'Smh the dude literally has a rubbish bin infront of his house',
"I think I won't be able to catch it within 1 min lol"
]
[11]:
%%time
pprint(model.generate(strings, to_lang = 'ms', max_length = 1000))
['Saya kenal ramai orang yang makan sambal ikan bilis.',
'Saya sering memvisualisasikan algoritma program saya sendiri sebelum '
'mengimplemmennya.',
'Adakah saya seorang sahaja yang menggunakan nama mereka sejak saya masih '
'kecil?',
'Kena berhati-hati dengan jerawat. Oh, mereka berdarah teruk apabila dipotong',
'Sial, lelaki itu benar-benar mempunyai tong sampah di depan rumahnya.',
'Saya rasa saya tidak akan dapat menangkapnya dalam masa 1 minit lol']
CPU times: user 6.07 s, sys: 9.2 ms, total: 6.08 s
Wall time: 519 ms
[12]:
%%time
pprint(model.generate(strings, to_lang = 'en', max_length = 1000))
['I know a lot of people who take snacks on sambal ikan bilis.',
'I often visualize my own program algorithm before impersonating it.',
'Am I the only one who has used their given name ever since I was a child?',
'You need to be cautious of pimples. Oh, they bleed badly when cut.',
'Oh my, the man is literally in a rubbish bin in front of his house.',
"I don't think I can catch it within 1 minute, haha"]
CPU times: user 8.09 s, sys: 4.09 ms, total: 8.1 s
Wall time: 685 ms
Local Mandarin#
[17]:
strings = [
'某个角度漂亮,但我觉得不是很耐看。',
'就是暂时好看的意思咯?',
'i think, 有狐狸般的妖媚,确实是第一人选。'
]
[18]:
%%time
pprint(model.generate(strings, to_lang = 'ms', max_length = 1000))
['Sudut yang cantik, tetapi saya rasa tidak begitu menarik.',
'Adakah ini bermaksud untuk sementara kelihatan cantik?',
'Saya rasa, mempunyai gadis-gadis yang sangat cantik dan cantik memang '
'menjadi pilihan pertama.']
CPU times: user 4.25 s, sys: 2.94 ms, total: 4.26 s
Wall time: 364 ms
[19]:
%%time
pprint(model.generate(strings, to_lang = 'en', max_length = 1000))
["A certain angle is beautiful, but I don't think it's very durable.",
'Is it just for now good-looking?',
'I believe that having a fox-like and cute demeanor is indeed the first '
'choice.']
CPU times: user 4.5 s, sys: 0 ns, total: 4.5 s
Wall time: 378 ms