EN to MS Noisy HuggingFace#

This tutorial is available as an IPython notebook at Malaya/example/noisy-en-ms-translation-huggingface.

This module trained on standard language and augmented local language structures, proceed with caution.

Required Tensorflow >= 2.0 for HuggingFace interface.

[1]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)
CPU times: user 5.77 s, sys: 1.13 s, total: 6.89 s
Wall time: 20.2 s

List available HuggingFace models#

[2]:
malaya.translation.en_ms.available_huggingface()
INFO:malaya.translation.en_ms:tested on 77k EN-MS test set generated from teacher semisupervised model, https://huggingface.co/datasets/mesolitica/en-ms
INFO:malaya.translation.en_ms:tested on FLORES200 EN-MS (eng_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200
WARNING:malaya.translation.en_ms:77k EN-MS test set generated from teacher semisupervised model, the models might generate better results compared to to the teacher semisupervised model, thus lower BLEU score.
[2]:
Size (MB) BLEU SacreBLEU Verbose SacreBLEU-chrF++-FLORES200 Suggested length
mesolitica/t5-super-tiny-finetuned-noisy-en-ms 50.8 58.72114 80.4/64.0/53.0/44.7 (BP = 0.994 ratio = 0.994 ... 64.22 256
mesolitica/t5-tiny-finetuned-noisy-en-ms 139 62.343084 82.6/67.5/57.2/49.3 (BP = 0.990 ratio = 0.990 ... 64.26 256
mesolitica/t5-small-finetuned-noisy-en-ms 242 65.000708 84.8/70.5/60.6/52.9 (BP = 0.983 ratio = 0.983 ... 66.31 256

Load Transformer models#

def huggingface(model: str = 'mesolitica/t5-tiny-finetuned-noisy-en-ms', **kwargs):
    """
    Load HuggingFace model to translate EN-to-MS.

    Parameters
    ----------
    model : str, optional (default='base')
        Model architecture supported. Allowed values:

        * ``'mesolitica/t5-super-tiny-finetuned-noisy-en-ms'`` - https://huggingface.co/mesolitica/t5-super-tiny-finetuned-noisy-en-ms
        * ``'mesolitica/t5-tiny-finetuned-noisy-en-ms'`` - https://huggingface.co/mesolitica/t5-tiny-finetuned-noisy-en-ms
        * ``'mesolitica/t5-small-finetuned-noisy-en-ms'`` - https://huggingface.co/mesolitica/t5-small-finetuned-noisy-en-ms

    Returns
    -------
    result: malaya.model.huggingface.Generator
    """
[3]:
transformer = malaya.translation.en_ms.transformer()
INFO:malaya_boilerplate.frozen_graph:running Users/huseinzolkepli/.cache/huggingface/hub using device /device:CPU:0
[12]:
transformer_noisy = malaya.translation.en_ms.huggingface(model = 'mesolitica/t5-small-finetuned-noisy-en-ms')

Translate#

def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.

    Returns
    -------
    result: List[str]
    """

For better results, always split by end of sentences.

[6]:
from pprint import pprint
[7]:
# https://www.malaymail.com/news/malaysia/2020/07/01/dr-mahathir-again-claims-anwar-lacks-popularity-with-malays-to-be-pakatans/1880420

string_news1 = 'KUALA LUMPUR, July 1 - Datuk Seri Anwar Ibrahim is not suitable to as the prime minister candidate as he is allegedly not "popular" among the Malays, Tun Dr Mahathir Mohamad claimed. The former prime minister reportedly said the PKR president needs someone like himself in order to acquire support from the Malays and win the election.'
pprint(string_news1)
('KUALA LUMPUR, July 1 - Datuk Seri Anwar Ibrahim is not suitable to as the '
 'prime minister candidate as he is allegedly not "popular" among the Malays, '
 'Tun Dr Mahathir Mohamad claimed. The former prime minister reportedly said '
 'the PKR president needs someone like himself in order to acquire support '
 'from the Malays and win the election.')
[8]:
# https://edition.cnn.com/2020/07/06/politics/new-york-attorney-general-blm/index.html

string_news2 = '(CNN)New York Attorney General Letitia James on Monday ordered the Black Lives Matter Foundation -- which she said is not affiliated with the larger Black Lives Matter movement -- to stop collecting donations in New York. "I ordered the Black Lives Matter Foundation to stop illegally accepting donations that were intended for the #BlackLivesMatter movement. This foundation is not affiliated with the movement, yet it accepted countless donations and deceived goodwill," James tweeted.'
pprint(string_news2)
('(CNN)New York Attorney General Letitia James on Monday ordered the Black '
 'Lives Matter Foundation -- which she said is not affiliated with the larger '
 'Black Lives Matter movement -- to stop collecting donations in New York. "I '
 'ordered the Black Lives Matter Foundation to stop illegally accepting '
 'donations that were intended for the #BlackLivesMatter movement. This '
 'foundation is not affiliated with the movement, yet it accepted countless '
 'donations and deceived goodwill," James tweeted.')
[9]:
# https://www.thestar.com.my/business/business-news/2020/07/04/malaysia-worries-new-eu-food-rules-could-hurt-palm-oil-exports

string_news3 = 'Amongst the wide-ranging initiatives proposed are a sustainable food labelling framework, a reformulation of processed foods, and a sustainability chapter in all EU bilateral trade agreements. The EU also plans to publish a proposal for a legislative framework for sustainable food systems by 2023 to ensure all foods on the EU market become increasingly sustainable.'
pprint(string_news3)
('Amongst the wide-ranging initiatives proposed are a sustainable food '
 'labelling framework, a reformulation of processed foods, and a '
 'sustainability chapter in all EU bilateral trade agreements. The EU also '
 'plans to publish a proposal for a legislative framework for sustainable food '
 'systems by 2023 to ensure all foods on the EU market become increasingly '
 'sustainable.')
[10]:
# https://jamesclear.com/articles

string_article1 = 'This page shares my best articles to read on topics like health, happiness, creativity, productivity and more. The central question that drives my work is, “How can we live better?” To answer that question, I like to write about science-based ways to solve practical problems.'
pprint(string_article1)
('This page shares my best articles to read on topics like health, happiness, '
 'creativity, productivity and more. The central question that drives my work '
 'is, “How can we live better?” To answer that question, I like to write about '
 'science-based ways to solve practical problems.')
[11]:
%%time

pprint(transformer_noisy.generate([string_news1, string_news2, string_news3],
                                 max_length = 1000))
['KUALA LUMPUR 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai menjadi calon '
 'perdana menteri kerana beliau didakwa tidak "popular" dalam kalangan orang '
 'Melayu, kata Tun Dr Mahathir Mohamad. Bekas perdana menteri itu dilaporkan '
 'berkata Presiden PKR memerlukan seseorang seperti dirinya untuk memperoleh '
 'sokongan daripada orang Melayu dan memenangi pilihan raya.',
 '(CNN) Peguam Negara New York Letitia James pada hari Isnin memerintahkan '
 'Yayasan Black Lives Matter - yang menurutnya tidak berafiliasi dengan '
 'gerakan Black Lives Matter yang lebih besar - untuk berhenti mengumpulkan '
 'sumbangan di New York. "Saya memerintahkan Yayasan Black Lives Matter untuk '
 'berhenti menerima sumbangan secara haram yang ditujukan untuk gerakan '
 '#BlackLivesMatter. Yayasan ini tidak berafiliasi dengan gerakan itu, namun '
 'ia menerima banyak sumbangan dan menipu muhibah," tweet James.',
 'Antara inisiatif luas yang dicadangkan adalah kerangka pelabelan makanan '
 'yang lestari, pembaharuan makanan yang diproses, dan bab kelestarian dalam '
 'semua perjanjian perdagangan dua hala EU. EU juga merancang untuk '
 'menerbitkan proposal untuk kerangka perundangan untuk sistem makanan lestari '
 'menjelang tahun 2023 untuk memastikan semua makanan di pasaran EU menjadi '
 'semakin lestari.']
CPU times: user 32.9 s, sys: 3.38 s, total: 36.3 s
Wall time: 16.8 s

compare results using local language structure#

[13]:
strings = [
    'u ni, talk properly lah',
    "just attended my cousin's wedding. pelik jugak dia buat majlis biasa2 je sebab her lifestyle looks lavish. then i found out they're going on a 3 weeks honeymoon. smart decision 👍",
    'Me after seeing this video: mm dapnya burger benjo extra mayo',
    'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:',
]
[14]:
%%time

transformer_noisy.generate(strings, max_length = 1000)
CPU times: user 13.4 s, sys: 1.71 s, total: 15.1 s
Wall time: 7.63 s
[14]:
['u ni, cakap betul lah',
 'baru sahaja menghadiri majlis perkahwinan sepupu saya. pelik juga dia buat majlis biasa2 kerana gaya hidupnya kelihatan mewah. kemudian saya mendapat tahu bahawa mereka akan berbulan madu selama 3 minggu. keputusan pintar ',
 'Saya setelah melihat video ini: mm dapnya burger benjo tambahan mayo',
 'Hai kawan! Saya perhatikan semalam & harini ramai yang dapat kuki ni. Jadi harini saya nak berkongsi beberapa post mortem kumpulan pertama kami:']
[15]:
%%time

transformer.greedy_decoder(strings)
CPU times: user 17.2 s, sys: 5.97 s, total: 23.2 s
Wall time: 16.5 s
[15]:
['u ni, bercakap dengan betul lah',
 'baru sahaja menghadiri majlis perkahwinan sepupu saya. jugak buat dia majlis biasa2 je sebab gaya hidupnya kelihatan mewah. kemudian saya mendapat tahu bahawa mereka akan berbulan madu selama 3 minggu. keputusan pintar',
 'Saya setelah melihat video ini: mm dapnya burger benjo extra mayo',
 'Hai kawan-kawan! Saya perhatikan semalam & harini dah ramai yang dapat cookies ni kan. Jadi harini saya nak berkongsi beberapa post mortem kumpulan pertama kami:']

compare with Google translate using googletrans#

Install it by,

pip3 install googletrans==4.0.0rc1
[18]:
from googletrans import Translator

translator = Translator()
[19]:
for t in strings:
    r = translator.translate(t, src='en', dest = 'ms')
    print(r.text)
u ni, bercakap dengan betul lah
Baru sahaja menghadiri majlis perkahwinan sepupu saya.Pelik Jugak Dia Buat Majlis Biasa2 Je Sebab Gaya Hidupnya kelihatan mewah.Kemudian saya dapati mereka akan berbulan madu selama 3 minggu.Keputusan Pintar 👍
Saya setelah melihat video ini: mm dapnya burger benjo tambahan mayo
Hai semua!Saya perhatikan Semalam & Harini Dah Ramai Yang Dapate Cookies Ni Kan.Jadi harini i nak berkongsi beberapa bedah siasat kumpulan pertama kami: