MS to EN HuggingFace#

This tutorial is available as an IPython notebook at Malaya/example/ms-en-translation-huggingface.

This module trained on standard language and augmented local language structures, proceed with caution.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)
CPU times: user 3.81 s, sys: 3.46 s, total: 7.27 s
Wall time: 3.17 s

List available HuggingFace models#

[3]:
malaya.translation.ms_en.available_huggingface()
INFO:malaya.translation.ms_en:tested on FLORES200 EN-MS (eng_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200
INFO:malaya.translation.ms_en:for noisy, tested on noisy twitter google translation, https://huggingface.co/datasets/mesolitica/augmentation-test-set
[3]:
Size (MB) BLEU SacreBLEU Verbose SacreBLEU-chrF++-FLORES200 Suggested length
mesolitica/finetune-translation-t5-tiny-standard-bahasa-cased-v2 139 37.260485 68.3/44.1/30.5/21.4 (BP = 0.995 ratio = 0.995 ... 61.29 256
mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2 242 42.010218 71.7/49.0/35.6/26.1 (BP = 0.989 ratio = 0.989 ... 64.67 256
mesolitica/finetune-translation-t5-base-standard-bahasa-cased-v2 892 43.408853 72.3/50.5/37.1/27.7 (BP = 0.987 ratio = 0.987 ... 65.44 256
mesolitica/finetune-noisy-translation-t5-tiny-bahasa-cased-v3 139 60.000967 77.9/63.9/54.6/47.7 (BP = 1.000 ratio = 1.036 ... None 256
mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v3 242 64.062582 80.1/67.7/59.1/52.5 (BP = 1.000 ratio = 1.042 ... None 256
mesolitica/finetune-noisy-translation-t5-base-bahasa-cased-v2 892 64.583819 80.2/68.1/59.8/53.2 (BP = 1.000 ratio = 1.048 ... None 256

Load Transformer models#

def huggingface(
    model: str = 'mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to translate MS-to-EN.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2')
        Check available models at `malaya.translation.ms_en.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.Generator
    """
[4]:
transformer = malaya.translation.ms_en.transformer()
2023-02-23 01:53:46.349840: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-23 01:53:46.357808: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-02-23 01:53:46.357857: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2023-02-23 01:53:46.357867: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2023-02-23 01:53:46.357980: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 470.161.3
2023-02-23 01:53:46.358017: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.161.3
2023-02-23 01:53:46.358026: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 470.161.3
[5]:
transformer_huggingface = malaya.translation.ms_en.huggingface()

Translate#

def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    result: List[str]
    """

For better results, always split by end of sentences.

[6]:
from pprint import pprint
[7]:
# https://www.sinarharian.com.my/article/89678/BERITA/Politik/Saya-tidak-mahu-sentuh-isu-politik-Muhyiddin

string_news1 = 'TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.'
pprint(string_news1)
('TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh '
 'mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal '
 'kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas '
 'berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika '
 'berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan '
 'Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.')
[8]:
# https://www.sinarharian.com.my/article/90021/BERITA/Politik/Tun-Mahathir-Anwar-disaran-bersara-untuk-selesai-kemelut-politik

string_news2 = 'ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.'
pprint(string_news2)
('ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila '
 'masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. '
 'Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya '
 'mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun '
 'Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri '
 'Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.')
[9]:
string_news3 = 'Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan lanjutan tempoh.'
pprint(string_news3)
('Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, '
 'kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi '
 'mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat '
 'asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) '
 'pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan '
 'lanjutan tempoh.')
[10]:
# https://qcikgubm.blogspot.com/2018/02/contoh-soalan-dan-jawapan-karangan.html

string_karangan = 'Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat membantu memberikan pengetahuan am tentang kerjaya ini'
pprint(string_karangan)
('Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang '
 'akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di '
 'Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang '
 'masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar '
 'berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi '
 'masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar '
 'disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki '
 'sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. '
 'Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk '
 'menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat '
 'membantu memberikan pengetahuan am tentang kerjaya ini')
[11]:
%%time

pprint(transformer.greedy_decoder([string_news1, string_news2, string_news3, string_karangan]))
['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on '
 'political issues at the moment, instead focusing on the welfare of the '
 "people and efforts to revitalize the affected country's economy following "
 'the Covid-19 pandemic. The prime minister explained the matter when speaking '
 'at a Leadership Meeting with Gambir State Assembly (DUN) leaders at the '
 'Bukit Gambir Multipurpose Hall today.',
 'ALOR SETAR - Pakatan Harapan (PH) political turmoil has not ended when it '
 "has failed to finalize the Prime Minister's candidate agreed upon. Sik MP "
 'Ahmad Tarmizi Sulaiman said he had suggested former United Nations (UN) '
 "Indigenous Party chairman Tun Dr Mahathir Mohamad and People's Justice Party "
 '(PKR) president Datuk Seri Anwar Ibrahim resign from politics as a solution.',
 'Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Yaakob said the '
 'relaxation was given as the government was aware of the problems they had to '
 'renew the document. He added that for foreigners who had passed the social '
 'visit during the Movement Control Order (CPP) they could go to the nearest '
 'Immigration Department office for further extension.',
 'In addition, career exhibitions help students determine their careers. As we '
 'know, the career market in Malaysia is very broad and there are still many '
 'job sectors in the country that are still vacant because it is difficult to '
 'find a truly qualified workforce. For example, the medical sector in '
 'Malaysia is facing a critical shortage of labor, especially specialists due '
 'to the resignation of doctors and physicians to enter the private sector and '
 'develop health and medical services. Upon realizing this fact, students will '
 'be more interested in medicine because the exhibition careers are very '
 'helpful in providing general knowledge of this career.']
CPU times: user 15 s, sys: 2.37 s, total: 17.4 s
Wall time: 6.5 s
[12]:
%%time

pprint(transformer_huggingface.generate([string_news1, string_news2, string_news3, string_karangan],
                                 max_length = 1000))
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on '
 'political issues for the time being, instead he wanted to focus on the '
 "welfare of the people and efforts to revive the country's economy affected "
 'following the Covid-19 pandemic. The Prime Minister explained the matter '
 'when speaking at the Meeting of Leaders with Gambir State Assembly (DUN) '
 'community leaders at the Bukit Gambir Multipurpose Hall today.',
 'ALOR SETAR - Pakatan Harapan (PH) political turmoil is endless when it still '
 'fails to finalize the Prime Minister candidate agreed together. Member of '
 'Parliament for Sik, Ahmad Tarmizi Sulaiman said, in connection with that, he '
 'suggested that the former Chairman of Parti Pribumi Bersatu Malaysia '
 '(Bersatu), Tun Dr Mahathir Mohamad and the President of Parti Keadilan '
 'Rakyat (PKR), Datuk Seri Anwar Ibrahim resign from politics as a solution.',
 'Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Yaakob said the '
 'flexibility was given following the government recognizing the problems they '
 'faced to renew the document. He said, apart from that, for foreigners who '
 'pass on social visits ended during the Movement Control Order (MCO), they '
 'could go to the nearest Immigration Department office to get a period '
 'extension.',
 'In addition, career exhibitions help students determine the careers they '
 'will venture into. As we know, the career market in Malaysia is very wide '
 'and there are still many job sectors in this country that are still empty '
 'because it is difficult to find a truly qualified workforce. For example, '
 'the medical sector in Malaysia is facing the problem of critical labor '
 'shortages, especially expert energy due to resignation by doctors and '
 'medical experts to enter the private sector as well as the development of '
 'health and medical services. Once realizing this fact, students will be more '
 'interested in entering the field of medicine because the career exhibition '
 'implemented greatly helps provide general knowledge about this career']
CPU times: user 19.4 s, sys: 12.8 ms, total: 19.4 s
Wall time: 1.68 s

compare with Google translate using googletrans#

Install it by,

pip3 install googletrans==4.0.0rc1
[13]:
from googletrans import Translator

translator = Translator()
[14]:
strings = [string_news1, string_news2, string_news3, string_karangan]
[15]:
for t in strings:
    r = translator.translate(t, src='ms', dest = 'en')
    print(r.text)
TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on political issues at this time, instead of focusing on the welfare of the people and efforts to regenerate the country's economy following the Covid -19 pandemic.The prime minister explained the matter when speaking at a ceremony with a leader of the Gambir State Assembly (DUN) community leader at the Bukit Gambir Multipurpose Hall today.
ALOR SETAR - The Pakatan Harapan (PH) political turmoil has not ended when it fails to finalize the agreed prime ministerial candidate.Sik Member of Parliament Ahmad Tarmizi Sulaiman said he had suggested former United Indigenous Party (UN) chairman Tun Dr Mahathir Mohamad and the People's Justice Party (PKR) president Datuk Seri Anwar Ibrahim resigned from politics as a solution.
Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Yaakob said the relaxation was given as the government was aware of the problem they had to renew the document.He said, for foreigners, the social visit ended during the Movement Control Order (CPP) could go to the nearest Immigration Department's office for extension.
In addition, career exhibitions help students determine the careers they will be involved in.As we know, the career market in Malaysia is very broad and there are still many employment sectors in the country that are still vacant because it is difficult to find a truly qualified workforce.For example, the medical sector in Malaysia is facing critical workforce problems, especially experts due to the resignation of doctors and physicians to enter the private sector as well as the growth of health and medical services.Upon realizing this fact, students will be more interested in getting into medicine because their career exhibitions are very helpful in providing general knowledge of this career