MS to EN HuggingFace#

This tutorial is available as an IPython notebook at Malaya/example/ms-en-translation-huggingface.

This module trained on standard language and augmented local language structures, proceed with caution.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)
CPU times: user 3.51 s, sys: 3.29 s, total: 6.8 s
Wall time: 2.64 s
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3372
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3890
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

List available HuggingFace models#

[3]:
malaya.translation.ms_en.available_huggingface()
INFO:malaya.translation.ms_en:tested on FLORES200 MS-EN (zsm_Latn-eng_Latn) pair, https://github.com/facebookresearch/flores/tree/main/flores200
[3]:
Size (MB) BLEU SacreBLEU Verbose SacreBLEU-chrF++-FLORES200 Suggested length
mesolitica/finetune-translation-t5-super-tiny-standard-bahasa-cased 50.7 34.105615 67.3/41.6/27.8/18.7 (BP = 0.982 ratio = 0.982 ... 59.18 256
mesolitica/finetune-translation-t5-tiny-standard-bahasa-cased 139 37.260485 68.3/44.1/30.5/21.4 (BP = 0.995 ratio = 0.995 ... 61.29 256
mesolitica/finetune-translation-t5-small-standard-bahasa-cased 242 42.010218 71.7/49.0/35.6/26.1 (BP = 0.989 ratio = 0.989 ... 64.67 256
mesolitica/finetune-translation-t5-base-standard-bahasa-cased 892 43.408853 72.3/50.5/37.1/27.7 (BP = 0.987 ratio = 0.987 ... 65.44 256

These huggingface models trained on:

  1. EN-MS dataset, https://huggingface.co/datasets/mesolitica/en-ms

  2. MS-EN dataset, https://huggingface.co/datasets/mesolitica/ms-en

  3. NLLB eng_Latn-zsm_Latn, https://github.com/huseinzol05/malay-dataset/tree/master/translation/laser

Load Transformer models#

def huggingface(model: str = 'mesolitica/finetune-translation-t5-small-standard-bahasa-cased', **kwargs):
    """
    Load HuggingFace model to translate MS-to-EN.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased')
        Check available models at `malaya.translation.ms_en.available_huggingface()`.

    Returns
    -------
    result: malaya.torch_model.huggingface.Generator
    """
[4]:
transformer = malaya.translation.ms_en.transformer()
2022-10-04 21:25:52.819612: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:malaya_boilerplate.frozen_graph:running home/husein/.cache/huggingface/hub/models--huseinzol05--translation-ms-en-base/snapshots/c163027ea2df8ba8364b601396fa89fcf263ece5 using device /device:CPU:0
2022-10-04 21:25:52.825155: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-10-04 21:25:52.825193: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-10-04 21:25:52.825200: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2022-10-04 21:25:52.825279: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-10-04 21:25:52.825311: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3
[6]:
transformer_huggingface = malaya.translation.ms_en.huggingface()

Translate#

def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    result: List[str]
    """

For better results, always split by end of sentences.

[7]:
from pprint import pprint
[8]:
# https://www.sinarharian.com.my/article/89678/BERITA/Politik/Saya-tidak-mahu-sentuh-isu-politik-Muhyiddin

string_news1 = 'TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.'
pprint(string_news1)
('TANGKAK - Tan Sri Muhyiddin Yassin berkata, beliau tidak mahu menyentuh '
 'mengenai isu politik buat masa ini, sebaliknya mahu menumpukan kepada soal '
 'kebajikan rakyat serta usaha merancakkan semula ekonomi negara yang terjejas '
 'berikutan pandemik Covid-19. Perdana Menteri menjelaskan perkara itu ketika '
 'berucap pada Majlis Bertemu Pemimpin bersama pemimpin masyarakat Dewan '
 'Undangan Negeri (DUN) Gambir di Dewan Serbaguna Bukit Gambir hari ini.')
[9]:
# https://www.sinarharian.com.my/article/90021/BERITA/Politik/Tun-Mahathir-Anwar-disaran-bersara-untuk-selesai-kemelut-politik

string_news2 = 'ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.'
pprint(string_news2)
('ALOR SETAR - Kemelut politik Pakatan Harapan (PH) belum berkesudahan apabila '
 'masih gagal memuktamadkan calon Perdana Menteri yang dipersetujui bersama. '
 'Ahli Parlimen Sik, Ahmad Tarmizi Sulaiman berkata, sehubungan itu pihaknya '
 'mencadangkan mantan Pengerusi Parti Pribumi Bersatu Malaysia (Bersatu), Tun '
 'Dr Mahathir Mohamad dan Presiden Parti Keadilan Rakyat (PKR), Datuk Seri '
 'Anwar Ibrahim mengundurkan diri daripada politik sebagai jalan penyelesaian.')
[10]:
string_news3 = 'Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan lanjutan tempoh.'
pprint(string_news3)
('Menteri Kanan (Kluster Keselamatan) Datuk Seri Ismail Sabri Yaakob berkata, '
 'kelonggaran itu diberi berikutan kerajaan menyedari masalah yang dihadapi '
 'mereka untuk memperbaharui dokumen itu. Katanya, selain itu, bagi rakyat '
 'asing yang pas lawatan sosial tamat semasa Perintah Kawalan Pergerakan (PKP) '
 'pula boleh ke pejabat Jabatan Imigresen yang terdekat untuk mendapatkan '
 'lanjutan tempoh.')
[11]:
# https://qcikgubm.blogspot.com/2018/02/contoh-soalan-dan-jawapan-karangan.html

string_karangan = 'Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat membantu memberikan pengetahuan am tentang kerjaya ini'
pprint(string_karangan)
('Selain itu, pameran kerjaya membantu para pelajar menentukan kerjaya yang '
 'akan diceburi oleh mereka. Seperti yang kita ketahui, pasaran kerjaya di '
 'Malaysia sangat luas dan masih banyak sektor pekerjaan di negara ini yang '
 'masih kosong kerana sukar untuk mencari tenaga kerja yang benar-benar '
 'berkelayakan. Sebagai contohnya, sektor perubatan di Malaysia menghadapi '
 'masalah kekurangan tenaga kerja yang kritikal, khususnya tenaga pakar '
 'disebabkan peletakan jawatan oleh doktor dan pakar perubatan untuk memasuki '
 'sektor swasta serta berkembangnya perkhidmatan kesihatan dan perubatan. '
 'Setelah menyedari  hakikat ini, para pelajar akan lebih berminat untuk '
 'menceburi bidang perubatan kerana pameran kerjaya yang dilaksanakan amat '
 'membantu memberikan pengetahuan am tentang kerjaya ini')
[13]:
%%time

pprint(transformer.greedy_decoder([string_news1, string_news2, string_news3, string_karangan]))
['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on '
 'political issues at the moment, instead focusing on the welfare of the '
 "people and efforts to revitalize the affected country's economy following "
 'the Covid-19 pandemic. The prime minister explained the matter when speaking '
 'at a Leadership Meeting with Gambir State Assembly (DUN) leaders at the '
 'Bukit Gambir Multipurpose Hall today.',
 'ALOR SETAR - Pakatan Harapan (PH) political turmoil has not ended when it '
 "has failed to finalize the Prime Minister's candidate agreed upon. Sik MP "
 'Ahmad Tarmizi Sulaiman said he had suggested former United Nations (UN) '
 "Indigenous Party chairman Tun Dr Mahathir Mohamad and People's Justice Party "
 '(PKR) president Datuk Seri Anwar Ibrahim resign from politics as a solution.',
 'Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Yaakob said the '
 'relaxation was given as the government was aware of the problems they had to '
 'renew the document. He added that for foreigners who had passed the social '
 'visit during the Movement Control Order (CPP) they could go to the nearest '
 'Immigration Department office for further extension.',
 'In addition, career exhibitions help students determine their careers. As we '
 'know, the career market in Malaysia is very broad and there are still many '
 'job sectors in the country that are still vacant because it is difficult to '
 'find a truly qualified workforce. For example, the medical sector in '
 'Malaysia is facing a critical shortage of labor, especially specialists due '
 'to the resignation of doctors and physicians to enter the private sector and '
 'develop health and medical services. Upon realizing this fact, students will '
 'be more interested in medicine because the exhibition careers are very '
 'helpful in providing general knowledge of this career.']
CPU times: user 15.7 s, sys: 2.18 s, total: 17.9 s
Wall time: 6.89 s
[15]:
%%time

pprint(transformer_huggingface.generate([string_news1, string_news2, string_news3, string_karangan],
                                 max_length = 1000))
['TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on '
 'political issues at this time, instead focusing on the welfare of the people '
 "and efforts to revive the country's economy affected by the Covid-19 "
 'pandemic. The prime minister explained this when speaking at a meeting of '
 'leaders with Gambir State Assembly (DUN) leaders at the Bukit Gambir '
 'Multipurpose Hall today.',
 'ALOR SETAR - Pakatan Harapan (PH) political turmoil has not ended when it '
 "has failed to finalise the Prime Minister's candidate. Sik Member of "
 'Parliament Ahmad Tarmizi Sulaiman said he had suggested that former United '
 "People's Party (UN) chairman Tun Dr Mahathir Mohamad and People's Justice "
 'Party (PKR) president Datuk Seri Anwar Ibrahim resign from politics as a '
 'solution.',
 'Senior Minister (Safety cluster) Datuk Seri Ismail Sabri Yaakob said the '
 'relaxation was given as the government was aware of the problems they faced '
 'to renew the document. He said that foreigners who had expired social visit '
 'during the Movement Control Order (MCO) could go to the nearest Immigration '
 'Department office for an extension.',
 'In addition, career exhibitions help students determine the careers they '
 'will pursue. As we know, the career market in Malaysia is vast and many job '
 'sectors in the country are still vacant because it is difficult to find a '
 'truly qualified workforce. For example, the medical sector in Malaysia is '
 'facing critical labor shortages, especially specialists due to the '
 'resignation of doctors and physicians to enter the private sector as well as '
 'the development of health and medical services. Having realized this fact, '
 'students will be more interested in pursuing medicine because the career '
 'exhibitions are being implemented greatly to provide general knowledge of '
 'this career.']
CPU times: user 16.7 s, sys: 96.4 ms, total: 16.8 s
Wall time: 1.48 s

compare with Google translate using googletrans#

Install it by,

pip3 install googletrans==4.0.0rc1
[16]:
from googletrans import Translator

translator = Translator()
[17]:
strings = [string_news1, string_news2, string_news3, string_karangan]
[18]:
for t in strings:
    r = translator.translate(t, src='ms', dest = 'en')
    print(r.text)
TANGKAK - Tan Sri Muhyiddin Yassin said he did not want to touch on political issues at this time, instead of focusing on the welfare of the people and efforts to regenerate the country's economy following the Covid -19 pandemic.The prime minister explained the matter when speaking at a ceremony with a leader of the Gambir State Assembly (DUN) community leader at the Bukit Gambir Multipurpose Hall today.
ALOR SETAR - The Pakatan Harapan (PH) political turmoil has not ended when it fails to finalize the agreed prime ministerial candidate.Sik Member of Parliament Ahmad Tarmizi Sulaiman said he had suggested former United Indigenous Party (UN) chairman Tun Dr Mahathir Mohamad and the People's Justice Party (PKR) president Datuk Seri Anwar Ibrahim resigned from politics as a solution.
Senior Minister (Security Cluster) Datuk Seri Ismail Sabri Yaakob said the relaxation was given as the government was aware of the problem they had to renew the document.He said, for foreigners, the social visit ended during the Movement Control Order (CPP) could go to the nearest Immigration Department's office for extension.
In addition, career exhibitions help students determine the careers they will be involved in.As we know, the career market in Malaysia is very broad and there are still many employment sectors in the country that are still vacant because it is difficult to find a truly qualified workforce.For example, the medical sector in Malaysia is facing critical workforce problems, especially experts due to the resignation of doctors and physicians to enter the private sector as well as the growth of health and medical services.Upon realizing this fact, students will be more interested in getting into medicine because their career exhibitions are very helpful in providing general knowledge of this career