MS to EN Noisy HuggingFace
Contents
MS to EN Noisy HuggingFace#
This tutorial is available as an IPython notebook at Malaya/example/noisy-ms-en-translation-huggingface.
This module trained on standard language and augmented local language structures, proceed with caution.
[1]:
%%time
import malaya
import logging
logging.basicConfig(level=logging.INFO)
CPU times: user 4.04 s, sys: 3.13 s, total: 7.17 s
Wall time: 3.56 s
List available HuggingFace models#
[2]:
malaya.translation.ms_en.available_huggingface()
INFO:malaya.translation.ms_en:tested on FLORES200 EN-MS (eng_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200
INFO:malaya.translation.ms_en:for noisy, tested on noisy twitter google translation, https://huggingface.co/datasets/mesolitica/augmentation-test-set
[2]:
Size (MB) | BLEU | SacreBLEU Verbose | SacreBLEU-chrF++-FLORES200 | Suggested length | |
---|---|---|---|---|---|
mesolitica/finetune-translation-t5-tiny-standard-bahasa-cased-v2 | 139 | 37.260485 | 68.3/44.1/30.5/21.4 (BP = 0.995 ratio = 0.995 ... | 61.29 | 256 |
mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2 | 242 | 42.010218 | 71.7/49.0/35.6/26.1 (BP = 0.989 ratio = 0.989 ... | 64.67 | 256 |
mesolitica/finetune-translation-t5-base-standard-bahasa-cased-v2 | 892 | 43.408853 | 72.3/50.5/37.1/27.7 (BP = 0.987 ratio = 0.987 ... | 65.44 | 256 |
mesolitica/finetune-noisy-translation-t5-tiny-bahasa-cased-v3 | 139 | 60.000967 | 77.9/63.9/54.6/47.7 (BP = 1.000 ratio = 1.036 ... | None | 256 |
mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v3 | 242 | 64.062582 | 80.1/67.7/59.1/52.5 (BP = 1.000 ratio = 1.042 ... | None | 256 |
mesolitica/finetune-noisy-translation-t5-base-bahasa-cased-v2 | 892 | 64.583819 | 80.2/68.1/59.8/53.2 (BP = 1.000 ratio = 1.048 ... | None | 256 |
Load Transformer models#
def huggingface(
model: str = 'mesolitica/finetune-translation-t5-small-standard-bahasa-cased',
force_check: bool = True,
**kwargs,
):
"""
Load HuggingFace model to translate MS-to-EN.
Parameters
----------
model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2')
Check available models at `malaya.translation.ms_en.available_huggingface()`.
force_check: bool, optional (default=True)
Force check model one of malaya model.
Set to False if you have your own huggingface model.
Returns
-------
result: malaya.torch_model.huggingface.Generator
"""
[4]:
transformer = malaya.translation.ms_en.huggingface()
[5]:
transformer_noisy = malaya.translation.ms_en.huggingface(model = 'mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v3')
Translate#
def generate(self, strings: List[str], **kwargs):
"""
Generate texts from the input.
Parameters
----------
strings : List[str]
**kwargs: vector arguments pass to huggingface `generate` method.
Returns
-------
result: List[str]
"""
For better results, always split by end of sentences.
[6]:
from pprint import pprint
[7]:
# https://www.sinarharian.com.my/article/89678/BERITA/Politik/Saya-tidak-mahu-sentuh-isu-politik-Muhyiddin
string_news1 = 'TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.'
pprint(string_news1)
('TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh '
'mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal '
'kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas '
'berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika '
'berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan '
'Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.')
[8]:
# https://www.sinarharian.com.my/article/90021/BERITA/Politik/Tun-Mahathir-Anwar-disaran-bersara-untuk-selesai-kemelut-politik
string_news2 = 'ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.'
pprint(string_news2)
('ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila '
'masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. '
'Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya '
'mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun '
'Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri '
'Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.')
[9]:
string_news3 = 'Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan lanjutan tempoh.'
pprint(string_news3)
('Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, '
'kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi '
'mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat '
'asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) '
'pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan '
'lanjutan tempoh.')
[10]:
# https://qcikgubm.blogspot.com/2018/02/contoh-soalan-dan-jawapan-karangan.html
string_karangan = 'Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. Setelah menyedari hakikat ini, para pelajar akan lebih berminat untuk menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat membantu memberikan pengetahuan am tentang kerjaya ini'
pprint(string_karangan)
('Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang '
'akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di '
'Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang '
'masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar '
'berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi '
'masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar '
'disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki '
'sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. '
'Setelah menyedari hakikat ini, para pelajar akan lebih berminat untuk '
'menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat '
'membantu memberikan pengetahuan am tentang kerjaya ini')
[11]:
%%time
pprint(transformer_noisy.generate([string_news1, string_news2, string_news3, string_karangan],
max_length = 1000))
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on '
'political issues for now, but instead wanted to focus on the welfare of the '
"people as well as efforts to revive the country's economy which was affected "
'by the Covid-19 pandemic. The Prime Minister explained the matter when '
"speaking at the Leaders' Meeting with community leaders of the Gambir State "
'Legislative Assembly (DUN) at the Bukit Gambir Multipurpose Hall today.',
'ALOR SETAR - The Pakatan Harapan (PH) political crisis has not ended when it '
'still fails to finalize the Prime Minister candidate who was mutually agreed '
'upon. Sik Member of Parliament, Ahmad Tarmizi Sulaiman said, in this regard, '
'he suggested that the former Chairman of Parti Pribumi Bersatu Malaysia '
'(Bersatu), Tun Dr Mahathir Mohamad and the President of Parti Keadilan '
'Rakyat (PKR), Datuk Seri Anwar Ibrahim resign from politics as a solution.',
'Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Yaakob said the '
"relaxation was given following the government's awareness of the problems "
'they faced to renew the document. He said, apart from that, for foreigners '
'who have expired social visit during the Movement Control Order (MCO), they '
'can go to the nearest Immigration Department office to obtain an extension '
'of the period.',
'In addition, career exhibitions help students determine the career they will '
'venture into. As we know, the career market in Malaysia is very broad and '
'there are still many job sectors in this country that are still empty '
'because it is difficult to find a truly qualified workforce. For example, '
'the medical sector in Malaysia is facing a critical labor shortage problem, '
'especially expert energy due to the resignation of doctors and medical '
'experts to enter the private sector as well as the development of health and '
'medical services. After realizing this fact, the students will be more '
'interested in entering the medical field because the career exhibition '
'implemented is very helpful in providing general knowledge about this career']
CPU times: user 20.1 s, sys: 24.8 ms, total: 20.1 s
Wall time: 1.73 s
compare results using local language structure#
[12]:
strings = [
'ak tak paham la',
'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:',
"Memanglah. Ini tak payah expert, aku pun tau. It's a gesture, bodoh.",
'jam 8 di pasar KK memang org ramai 😂, pandai dia pilih tmpt.',
'Jadi haram jadah😀😃ðŸ¤',
'nak gi mana tuu',
'Macam nak ambil half day',
"Bayangkan PH dan menang pru-14. Pastu macam-macam pintu belakang ada. Last-last Ismail Sabri naik. That's why I don't give a fk about politics anymore. Sumpah dah fk up dah.",
]
[13]:
%%time
transformer.generate(strings, max_length = 1000)
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
CPU times: user 11.2 s, sys: 17.8 ms, total: 11.2 s
Wall time: 997 ms
[13]:
["I don't understand",
'Hi guys! I noticed yesterday & there are many people who got this cookies, right. So dayni I want to share some post mortem of our first batch:',
"Indeed. This doesn't bother expert, I also know. It's a gesture, stupid.",
'at 8 at the KK market, there are many people, he is good at choosing tmpt.',
"So it's illegal jadah",
'want to go where is it',
'Like taking half day',
"Imagine PH and won pru-14. There are all kinds of back doors. Last-last Ismail Sabri went up. That's why I don't give a fk about politics anymore. The oath is fk up."]
[14]:
%%time
transformer_noisy.generate(strings, max_length = 1000)
CPU times: user 9.36 s, sys: 5.6 ms, total: 9.36 s
Wall time: 809 ms
[14]:
["I don't understand",
'Hi guys! I noticed yesterday & today many people got these cookies, right? So today I want to share some post mortem of our first batch:',
"Indeed. This doesn't need an expert, I know too. It's a gesture, stupid.",
"at 8 o'clock at the KK market there are many people, he is good at choosing a place.",
"So it's illegal ",
'where do you want to go?',
"It's like taking half a day",
"Imagine PAKATAN HARAPAN and winning pru-14. After that there are various back doors. Last-last Ismail Sabri went up. That's why I don't give a fk about politics anymore. I swear it's already up."]
compare with Google translate using googletrans#
Install it by,
pip3 install googletrans==4.0.0rc1
[15]:
from googletrans import Translator
translator = Translator()
[16]:
for t in strings:
r = translator.translate(t, src='ms', dest = 'en')
print(r.text)
I don't understand
Hi guys!I noticed yesterday & today many have got these cookies.So today I want to share some post mortem of our first batch:
That's it.This is not an expert, I know.It's a gesture, stupid.
At 8 o'clock in the KK market is a lot of people 😂, he's good at choosing TMPT.
So it's illegal to make it
Where are you going
It's like taking half day
Imagine PH and won the GE-14.There must be all kinds of back door.Last-last Ismail Sabri went up.That's why I don't give a fk about politics anymore.I swear it's up.