Noisy#

This tutorial is available as an IPython notebook at Malaya/example/noisy-translation.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CPU times: user 2.86 s, sys: 3.84 s, total: 6.7 s
Wall time: 1.93 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

List available HuggingFace models#

[3]:
malaya.translation.available_huggingface
[3]:
{'mesolitica/translation-t5-tiny-standard-bahasa-cased': {'Size (MB)': 139,
  'Suggested length': 1536,
  'en-ms chrF2++': 65.91,
  'ms-en chrF2++': 61.3,
  'ind-ms chrF2++': 58.15,
  'jav-ms chrF2++': 49.33,
  'pasar ms-ms chrF2++': 58.46,
  'pasar ms-en chrF2++': 55.76,
  'manglish-ms chrF2++': 51.04,
  'manglish-en chrF2++': 52.2,
  'from lang': ['en', 'ms', 'ind', 'jav', 'bjn', 'manglish', 'pasar ms'],
  'to lang': ['en', 'ms']},
 'mesolitica/translation-t5-small-standard-bahasa-cased': {'Size (MB)': 242,
  'Suggested length': 1536,
  'en-ms chrF2++': 67.37,
  'ms-en chrF2++': 63.79,
  'ind-ms chrF2++': 58.09,
  'jav-ms chrF2++': 52.11,
  'pasar ms-ms chrF2++': 62.49,
  'pasar ms-en chrF2++': 60.77,
  'manglish-ms chrF2++': 52.84,
  'manglish-en chrF2++': 53.65,
  'from lang': ['en', 'ms', 'ind', 'jav', 'bjn', 'manglish', 'pasar ms'],
  'to lang': ['en', 'ms']},
 'mesolitica/translation-t5-base-standard-bahasa-cased': {'Size (MB)': 892,
  'Suggested length': 1536,
  'en-ms chrF2++': 67.62,
  'ms-en chrF2++': 64.41,
  'ind-ms chrF2++': 59.25,
  'jav-ms chrF2++': 52.86,
  'pasar ms-ms chrF2++': 62.99,
  'pasar ms-en chrF2++': 62.06,
  'manglish-ms chrF2++': 54.4,
  'manglish-en chrF2++': 54.14,
  'from lang': ['en', 'ms', 'ind', 'jav', 'bjn', 'manglish', 'pasar ms'],
  'to lang': ['en', 'ms']},
 'mesolitica/translation-nanot5-tiny-malaysian-cased': {'Size (MB)': 205,
  'Suggested length': 2048,
  'en-ms chrF2++': 63.61,
  'ms-en chrF2++': 59.55,
  'ind-ms chrF2++': 56.38,
  'jav-ms chrF2++': 47.68,
  'mandarin-ms chrF2++': 36.61,
  'mandarin-en chrF2++': 39.78,
  'pasar ms-ms chrF2++': 58.74,
  'pasar ms-en chrF2++': 54.87,
  'manglish-ms chrF2++': 50.76,
  'manglish-en chrF2++': 53.16,
  'from lang': ['en',
   'ms',
   'ind',
   'jav',
   'bjn',
   'manglish',
   'pasar ms',
   'mandarin',
   'pasar mandarin'],
  'to lang': ['en', 'ms']},
 'mesolitica/translation-nanot5-small-malaysian-cased': {'Size (MB)': 358,
  'Suggested length': 2048,
  'en-ms chrF2++': 66.98,
  'ms-en chrF2++': 63.52,
  'ind-ms chrF2++': 58.1,
  'jav-ms chrF2++': 51.55,
  'mandarin-ms chrF2++': 46.09,
  'mandarin-en chrF2++': 44.13,
  'pasar ms-ms chrF2++': 63.2,
  'pasar ms-en chrF2++': 59.78,
  'manglish-ms chrF2++': 54.09,
  'manglish-en chrF2++': 55.27,
  'from lang': ['en',
   'ms',
   'ind',
   'jav',
   'bjn',
   'manglish',
   'pasar ms',
   'mandarin',
   'pasar mandarin'],
  'to lang': ['en', 'ms']},
 'mesolitica/translation-nanot5-base-malaysian-cased': {'Size (MB)': 990,
  'Suggested length': 2048,
  'en-ms chrF2++': 67.87,
  'ms-en chrF2++': 64.79,
  'ind-ms chrF2++': 56.98,
  'jav-ms chrF2++': 51.21,
  'mandarin-ms chrF2++': 47.39,
  'mandarin-en chrF2++': 48.78,
  'pasar ms-ms chrF2++': 65.06,
  'pasar ms-en chrF2++': 64.03,
  'manglish-ms chrF2++': 57.91,
  'manglish-en chrF2++': 55.66,
  'from lang': ['en',
   'ms',
   'ind',
   'jav',
   'bjn',
   'manglish',
   'pasar ms',
   'mandarin',
   'pasar mandarin'],
  'to lang': ['en', 'ms']}}
[4]:
print(malaya.translation.info)
1. tested on FLORES200 pair `dev` set, https://github.com/huseinzol05/malay-dataset/tree/master/translation/flores200-eval
2. tested on noisy test set, https://github.com/huseinzol05/malay-dataset/tree/master/translation/noisy-eval
3. check out NLLB 200 metrics from `malaya.translation.nllb_metrics`.
4. check out Google Translate metrics from `malaya.translation.google_translate_metrics`.

Improvements of new model#

  1. able to translate [en, ms, ind, jav, bjn, manglish, pasar ms, mandarin, pasar mandarin] while old model only able to translate [en, ms, pasar ms].

  2. No longer required from_lang part of the prefix.

  3. able to retain text structure as it is.

Load Transformer models#

def huggingface(
    model: str = 'mesolitica/translation-t5-small-standard-bahasa-cased',
    force_check: bool = True,
    from_lang: List[str] = None,
    to_lang: List[str] = None,
    old_model: bool = False,
    **kwargs,
):
    """
    Load HuggingFace model to translate.

    Parameters
    ----------
    model: str, optional (default='mesolitica/translation-t5-small-standard-bahasa-cased')
        Check available models at `malaya.translation.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.Translation
    """
[5]:
model = malaya.translation.huggingface(model = 'mesolitica/translation-nanot5-small-malaysian-cased')
Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Translate#

def generate(self, strings: List[str], to_lang: str = 'ms', **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    to_lang: str, optional (default='ms')
        target language to translate.
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

        If you are using `use_ctranslate2`, vector arguments pass to ctranslate2 `translate_batch` method.
        Read more at https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?highlight=translate_batch#ctranslate2.Translator.translate_batch

    Returns
    -------
    result: List[str]
    """
[6]:
from pprint import pprint

Noisy malay#

[7]:
strings = [
    'ak tak paham la',
    'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:',
    "Memanglah. Ini tak payah expert, aku pun tau. It's a gesture, bodoh.",
    'jam 8 di pasar KK memang org ramai 😂, pandai dia pilih tmpt.',
    'Jadi haram jadah😀😃🤭',
    'nak gi mana tuu',
    'Macam nak ambil half day',
    "Bayangkan PH dan menang pru-14. Pastu macam-macam pintu belakang ada. Last-last Ismail Sabri naik. That's why I don't give a fk about politics anymore. Sumpah dah fk up dah.",
    'mesolitica boleh buat asr tak',
]
[8]:
%%time

pprint(model.generate(strings, to_lang = 'ms', max_length = 1000))
spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.
['Saya tidak faham',
 'Hi guys! Saya perasan semalam dan hari ini ramai yang dapat cookies ni kan. '
 'Jadi hari ini saya ingin berkongsi beberapa post mortem dari batch pertama '
 'kami:',
 'Memanglah. Ini tidak perlu pakar, saya juga tahu. Ini adalah isyarat, bodoh.',
 'Jam 8 di pasar KK memang ramai orang 😂, pandai dia pilih tempat.',
 'Jadi haram jadah 😀😃🤭',
 'Ke mana kamu pergi?',
 'Saya ingin mengambil separuh hari',
 'Bayangkan PH dan menang dalam PRU-14. Kemudian terdapat pelbagai pintu '
 'belakang. Akhirnya, Ismail Sabri naik. Itulah sebabnya saya tidak lagi '
 'peduli tentang politik. Saya bersumpah saya sudah pergi.',
 'Bolehkah mesolitica digunakan untuk membuat asr?']
CPU times: user 34.7 s, sys: 47 ms, total: 34.8 s
Wall time: 2.94 s
[9]:
%%time

pprint(model.generate(strings, to_lang = 'en', max_length = 1000))
["I don't understand",
 'Hi guys! I noticed that many people have received cookies yesterday and '
 'today. So today I want to share some post mortem of our first batch:',
 "Indeed. No need for an expert, I know. It's a gesture, stupid.",
 "At 8 o'clock in the KK market, it's crowded 😂, he's clever in choosing a "
 'place.',
 "So it's illegal😀😃🤭",
 'Where are you going?',
 'How to take half a day',
 'Imagine PH and winning the 14th general election. Then there are all sorts '
 "of backgazes. In the end, Ismail Sabri got in. That's why I don't care about "
 "politics anymore. I swear I'm already fucked up.",
 'Can the mesolitica make Asr?']
CPU times: user 51.6 s, sys: 89.9 ms, total: 51.7 s
Wall time: 4.41 s

Manglish#

[10]:
strings = [
    'i know plenty of people who snack on sambal ikan bilis.',
    'I often visualize my own programs algorithm before implemment it.',
    'Am I the only one who used their given name ever since I was a kid?',
    'Gotta be wary of pimples. Oh they bleed bad when cut',
    'Smh the dude literally has a rubbish bin infront of his house',
    "I think I won't be able to catch it within 1 min lol"
]
[11]:
%%time

pprint(model.generate(strings, to_lang = 'ms', max_length = 1000))
['Saya kenal ramai orang yang makan sambal ikan bilis.',
 'Saya sering memvisualisasikan algoritma program saya sendiri sebelum '
 'mengimplemmennya.',
 'Adakah saya seorang sahaja yang menggunakan nama mereka sejak saya masih '
 'kecil?',
 'Kena berhati-hati dengan jerawat. Oh, mereka berdarah teruk apabila dipotong',
 'Sial, lelaki itu benar-benar mempunyai tong sampah di depan rumahnya.',
 'Saya rasa saya tidak akan dapat menangkapnya dalam masa 1 minit lol']
CPU times: user 6.07 s, sys: 9.2 ms, total: 6.08 s
Wall time: 519 ms
[12]:
%%time

pprint(model.generate(strings, to_lang = 'en', max_length = 1000))
['I know a lot of people who take snacks on sambal ikan bilis.',
 'I often visualize my own program algorithm before impersonating it.',
 'Am I the only one who has used their given name ever since I was a child?',
 'You need to be cautious of pimples. Oh, they bleed badly when cut.',
 'Oh my, the man is literally in a rubbish bin in front of his house.',
 "I don't think I can catch it within 1 minute, haha"]
CPU times: user 8.09 s, sys: 4.09 ms, total: 8.1 s
Wall time: 685 ms

Local Mandarin#

[17]:
strings = [
    '某个角度漂亮,但我觉得不是很耐看。',
    '就是暂时好看的意思咯?',
    'i think, 有狐狸般的妖媚,确实是第一人选。'
]
[18]:
%%time

pprint(model.generate(strings, to_lang = 'ms', max_length = 1000))
['Sudut yang cantik, tetapi saya rasa tidak begitu menarik.',
 'Adakah ini bermaksud untuk sementara kelihatan cantik?',
 'Saya rasa, mempunyai gadis-gadis yang sangat cantik dan cantik memang '
 'menjadi pilihan pertama.']
CPU times: user 4.25 s, sys: 2.94 ms, total: 4.26 s
Wall time: 364 ms
[19]:
%%time

pprint(model.generate(strings, to_lang = 'en', max_length = 1000))
["A certain angle is beautiful, but I don't think it's very durable.",
 'Is it just for now good-looking?',
 'I believe that having a fox-like and cute demeanor is indeed the first '
 'choice.']
CPU times: user 4.5 s, sys: 0 ns, total: 4.5 s
Wall time: 378 ms