MS to EN Noisy HuggingFace#

This tutorial is available as an IPython notebook at Malaya/example/noisy-ms-en-translation-huggingface.

This module trained on standard language and augmented local language structures, proceed with caution.

[1]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)
CPU times: user 3.25 s, sys: 3.34 s, total: 6.59 s
Wall time: 2.44 s

List available HuggingFace models#

[2]:
malaya.translation.ms_en.available_huggingface()
INFO:malaya.translation.ms_en:tested on FLORES200 EN-MS (eng_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200
INFO:malaya.translation.ms_en:for noisy, tested on noisy twitter google translation, https://huggingface.co/datasets/mesolitica/augmentation-test-set
[2]:
Size (MB) BLEU SacreBLEU Verbose SacreBLEU-chrF++-FLORES200 Suggested length
mesolitica/finetune-translation-t5-super-super-tiny-standard-bahasa-cased 23.3 30.216144 64.9/38.1/24.1/15.3 (BP = 0.978 ratio = 0.978 ... 56.46 256
mesolitica/finetune-translation-t5-super-tiny-standard-bahasa-cased 50.7 34.105615 67.3/41.6/27.8/18.7 (BP = 0.982 ratio = 0.982 ... 59.18 256
mesolitica/finetune-translation-t5-tiny-standard-bahasa-cased 139 37.260485 68.3/44.1/30.5/21.4 (BP = 0.995 ratio = 0.995 ... 61.29 256
mesolitica/finetune-translation-t5-small-standard-bahasa-cased 242 42.010218 71.7/49.0/35.6/26.1 (BP = 0.989 ratio = 0.989 ... 64.67 256
mesolitica/finetune-translation-t5-base-standard-bahasa-cased 892 43.408853 72.3/50.5/37.1/27.7 (BP = 0.987 ratio = 0.987 ... 65.44 256
mesolitica/finetune-noisy-translation-t5-tiny-bahasa-cased 139 39.725134 69.8/46.2/32.8/23.6 (BP = 0.999 ratio = 0.999 ... None 256
mesolitica/finetune-noisy-translation-t5-small-bahasa-cased 242 41.834071 71.7/48.7/35.4/26.0 (BP = 0.989 ratio = 0.989 ... None 256
mesolitica/finetune-noisy-translation-t5-base-bahasa-cased 242 43.432723 71.8/49.8/36.6/27.2 (BP = 1.000 ratio = 1.000 ... None 256
mesolitica/finetune-noisy-translation-t5-tiny-bahasa-cased-v2 139 41.625536 73.4/50.1/35.7/25.7 (BP = 0.971 ratio = 0.972 ... None 256
mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v4 242 41.625536 73.4/50.1/35.7/25.7 (BP = 0.971 ratio = 0.972 ... None 256
mesolitica/finetune-noisy-translation-t5-base-bahasa-cased-v2 892 41.625536 73.4/50.1/35.7/25.7 (BP = 0.971 ratio = 0.972 ... None 256

Load Transformer models#

def huggingface(
    model: str = 'mesolitica/finetune-translation-t5-small-standard-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to translate MS-to-EN.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased')
        Check available models at `malaya.translation.ms_en.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.Generator
    """
[4]:
transformer = malaya.translation.ms_en.huggingface(model = 'mesolitica/finetune-translation-t5-small-standard-bahasa-cased')
[5]:
transformer_noisy = malaya.translation.ms_en.huggingface(model = 'mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v4')

Translate#

def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.

    Returns
    -------
    result: List[str]
    """

For better results, always split by end of sentences.

[6]:
from pprint import pprint
[7]:
# https://www.sinarharian.com.my/article/89678/BERITA/Politik/Saya-tidak-mahu-sentuh-isu-politik-Muhyiddin

string_news1 = 'TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.'
pprint(string_news1)
('TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh '
 'mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal '
 'kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas '
 'berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika '
 'berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan '
 'Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.')
[8]:
# https://www.sinarharian.com.my/article/90021/BERITA/Politik/Tun-Mahathir-Anwar-disaran-bersara-untuk-selesai-kemelut-politik

string_news2 = 'ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.'
pprint(string_news2)
('ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila '
 'masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. '
 'Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya '
 'mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun '
 'Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri '
 'Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.')
[9]:
string_news3 = 'Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan lanjutan tempoh.'
pprint(string_news3)
('Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, '
 'kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi '
 'mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat '
 'asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) '
 'pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan '
 'lanjutan tempoh.')
[10]:
# https://qcikgubm.blogspot.com/2018/02/contoh-soalan-dan-jawapan-karangan.html

string_karangan = 'Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat membantu memberikan pengetahuan am tentang kerjaya ini'
pprint(string_karangan)
('Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang '
 'akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di '
 'Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang '
 'masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar '
 'berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi '
 'masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar '
 'disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki '
 'sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. '
 'Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk '
 'menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat '
 'membantu memberikan pengetahuan am tentang kerjaya ini')
[11]:
%%time

pprint(transformer_noisy.generate([string_news1, string_news2, string_news3, string_karangan],
                                 max_length = 1000))
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
['TANGKAK - Tan Sri Muhyiddin Yassin said, he does not want to touch on '
 'political issues for now, but rather wants to focus on the welfare of the '
 "people and efforts to revive the country's affected economy due to the "
 'Covid-19 pandemic. The Prime Minister explained the matter when speaking at '
 'the meeting ceremony of leaders with the community leader of the Gambir '
 'State Legislative Assembly (DUN) at the Bukit Gambir Multipurpose Hall '
 'today.',
 'ALOR SETAR - The political turmoil of Pakatan Harapan (PAKATAN HARAPAN) has '
 "not ended when it still fails to finalize the Prime Minister's candidate "
 'agreed together. Sik Member of Parliament, Ahmad Tarmizi Sulaiman said, in '
 'relation to that he suggested the former Chairman of the United Malaysian '
 'Indigenous Party (Bersatu), Tun Doktor Mahathir Mohamad and the President of '
 "the People's Justice Party (PKR), Datuk Seri Anwar Ibrahim resigning from "
 'politics as a solution.',
 'Senior Minister (security cluster) Datuk Seri Ismail Sabri Yaakob said that '
 'the relaxation was given as the government realized the problems they were '
 'facing to renew the document. He said, besides that, for foreigners who pass '
 'the social visit, they can go to the nearest Immigration Department office '
 'to get an extension of the period.',
 'In addition, career exhibitions help students determine the career they will '
 'be entertained. As we know, the career market in Malaysia is very wide and '
 'there are still many job sectors in this country that are still empty '
 'because it is difficult to find a really qualified workforce. For example, '
 'the medical sector in Malaysia faces the problem of lack of critical '
 'workforce, especially specialist energy due to the resignation of doctors '
 'and medical experts to enter the private sector and the development of '
 'health and medical services. After realizing this fact, students will be '
 'more interested in entering the medical field because the career exhibition '
 'carried out is very helpful to provide general knowledge about this career']
CPU times: user 21.6 s, sys: 8.49 ms, total: 21.6 s
Wall time: 1.83 s

compare results using local language structure#

[12]:
strings = [
    'ak tak paham la',
    'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:',
    "Memanglah. Ini tak payah expert, aku pun tau. It's a gesture, bodoh.",
    'jam 8 di pasar KK memang org ramai 😂, pandai dia pilih tmpt.',
    'Jadi haram jadah😀😃🤭',
    'nak gi mana tuu',
    'Macam nak ambil half day',
    "Bayangkan PH dan menang pru-14. Pastu macam-macam pintu belakang ada. Last-last Ismail Sabri naik. That's why I don't give a fk about politics anymore. Sumpah dah fk up dah.",
]
[15]:
%%time

transformer.generate(strings, max_length = 1000)
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
CPU times: user 8.59 s, sys: 0 ns, total: 8.59 s
Wall time: 720 ms
[15]:
["I don't understand",
 'Hello guys! I noticed yesterday & today there are many who got these cookies, right? So today I want to share some post mortem of our first batch:',
 "It's not a expert, I know. It's a gesture, stupid.",
 "At 8am, the KK market is very good, and it's a good idea to choose a tmpt.",
 "So it's a fucking shit.",
 'Where do you want to go?',
 "It's like taking half a day.",
 "Imagine PH and win pru-14. Pastu all sorts of back door there. Last-last Ismail Sabri naik. That's why I don't give a fk about politics anymore. I swear it's fk up."]
[16]:
%%time

transformer_noisy.generate(strings, max_length = 1000)
CPU times: user 9.91 s, sys: 1.92 ms, total: 9.91 s
Wall time: 832 ms
[16]:
["I don't understand",
 'Hi guys! I noticed yesterday & today many people got these cookies, right? So today I want to share some post mortem of our first batch:',
 "Indeed. This does not need to be an expert, I know too. It's a gesture, stupid.",
 "at 8 o'clock at the OKAY market it's really crowded, he's good at choosing a place.",
 "So it's illegal ",
 'where do you want to go?',
 'Like taking half a day',
 "Imagine PAKATAN HARAPAN and winning pru-14. After that, all kinds of backdoors are there. Last-last Ismail Sabri went up. That's why I don't give it about politics anymore. I swear it's up."]

compare with Google translate using googletrans#

Install it by,

pip3 install googletrans==4.0.0rc1
[17]:
from googletrans import Translator

translator = Translator()
[18]:
for t in strings:
    r = translator.translate(t, src='ms', dest = 'en')
    print(r.text)
I don't understand
Hi guys!I noticed yesterday & today many have got these cookies.So today I want to share some post mortem of our first batch:
That's it.This is not an expert, I know.It's a gesture, stupid.
At 8 o'clock in the KK market is a lot of people 😂, he's good at choosing TMPT.
So it's illegal to make it
Where are you going
It's like taking half day
Imagine PH and won the GE-14.There must be all kinds of back doors.Last-last Ismail Sabri went up.That's why I don't give a fk about politics anymore.I swear it's up.