EN to MS#

This tutorial is available as an IPython notebook at Malaya/example/en-ms-translation.

This module only trained on standard language structure, so it is not save to use it for local language structure.

This interface deprecated, use HuggingFace interface instead.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import logging

logging.basicConfig(level=logging.INFO)
[3]:
%%time
import malaya
CPU times: user 3.14 s, sys: 3.42 s, total: 6.56 s
Wall time: 2.28 s
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3372
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3890
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

Load dictionary#

def dictionary(**kwargs):
    """
    Load dictionary {EN: MS} .

    Returns
    -------
    result: Dict[str, str]
    """
[4]:
dictionary = malaya.translation.en_ms.dictionary()
INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v23-preprocessing/english-malay-200k.json
[5]:
dictionary.get('chicken')
[5]:
'ayam'

List available Transformer models#

[4]:
import warnings
warnings.filterwarnings('default')
[5]:
malaya.translation.en_ms.available_transformer()
/home/husein/dev/malaya/malaya/translation/en_ms.py:159: DeprecationWarning: `malaya.translation.en_ms.available_transformer` is deprecated, use `malaya.translation.en_ms.available_huggingface` instead
  warnings.warn('`malaya.translation.en_ms.available_transformer` is deprecated, use `malaya.translation.en_ms.available_huggingface` instead', DeprecationWarning)
INFO:malaya.translation.en_ms:tested on FLORES200 EN-MS (eng_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200
INFO:malaya.translation.en_ms:for noisy, tested on noisy augmented FLORES200 EN-MS (eng_Latn-zsm_Latn) pair `dev` set, https://github.com/huseinzol05/malay-dataset/tree/master/translation/nllb-noisy-dev-augmentation
[5]:
Size (MB) Quantized Size (MB) BLEU SacreBLEU Verbose SacreBLEU-chrF++-FLORES200 Suggested length
small 42.7 13.4 39.805387 80.2/63.8/52.8/44.4 (BP = 0.997 ratio = 0.997 ... 64.46 256
base 234 82.7 42.210713 86.3/73.3/64.1/56.8 (BP = 0.985 ratio = 0.985 ... 66.28 256
bigbird 246 63.7 39.090717 70.5/46.7/32.4/22.9 (BP = 0.989 ratio = 0.989 ... 63.96 1024
small-bigbird 50.4 13.1 36.90195 67.0/43.8/30.1/21.0 (BP = 1.000 ratio = 1.028 ... 62.85 1024
noisy-base 234 82.7 41.827831 73.1/49.7/35.3/25.4 (BP = 0.985 ratio = 0.985 ... 66.46 256

Load Transformer models#

def transformer(model: str = 'base', quantized: bool = False, **kwargs):
    """
    Load Transformer encoder-decoder model to translate EN-to-MS.

    Parameters
    ----------
    model : str, optional (default='base')
        Model architecture supported. Allowed values:

        * ``'small'`` - Transformer SMALL parameters.
        * ``'base'`` - Transformer BASE parameters.
        * ``'bigbird'`` - BigBird BASE parameters.
        * ``'small-bigbird'`` - BigBird SMALL parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: model
        if `bigbird` in model, return malaya.model.bigbird.Translation
        else, return malaya.model.tf.Translation
    """
[6]:
transformer = malaya.translation.en_ms.transformer()
transformer_small = malaya.translation.en_ms.transformer(model = 'small')
/home/husein/dev/malaya/malaya/translation/en_ms.py:209: DeprecationWarning: `malaya.translation.en_ms.transformer` is deprecated, use `malaya.translation.en_ms.huggingface` instead
  warnings.warn(
INFO:malaya_boilerplate.frozen_graph:running home/husein/.cache/huggingface/hub/models--huseinzol05--translation-en-ms-base/snapshots/a2f02ffbb51f5c2226126d4fa9a02f7aa36d20be using device /device:CPU:0
2022-10-21 12:33:35.533093: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-21 12:33:35.537338: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-10-21 12:33:35.537359: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-10-21 12:33:35.537363: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2022-10-21 12:33:35.537427: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-10-21 12:33:35.537449: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3
INFO:malaya_boilerplate.frozen_graph:running home/husein/.cache/huggingface/hub/models--huseinzol05--translation-en-ms-small/snapshots/154b07d08054ad5ad65c7dba4e1a5d49762dce85 using device /device:CPU:0

Load Quantized model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[8]:
quantized_transformer = malaya.translation.en_ms.transformer(quantized = True)

Translate#

Using greedy decoder#

def greedy_decoder(self, strings: List[str]):
    """
    translate list of strings.

    Parameters
    ----------
    strings : List[str]

    Returns
    -------
    result: List[str]
    """

Using beam decoder#

def beam_decoder(self, strings: List[str], beam_size: int = 3, temperature: float = 0.5):
    """
    translate list of strings using beam decoder.
    Currently only `noisy` models supported `beam_size` and `temperature` parameters.

    Parameters
    ----------
    strings : List[str]
    beam_size: int, optional (default=3)
    temperature: float, optional (default=0.5)

    Returns
    -------
    result: List[str]
    """
[10]:
from pprint import pprint
[11]:
# https://www.malaymail.com/news/malaysia/2020/07/01/dr-mahathir-again-claims-anwar-lacks-popularity-with-malays-to-be-pakatans/1880420

string_news1 = 'KUALA LUMPUR, July 1 - Datuk Seri Anwar Ibrahim is not suitable to as the prime minister candidate as he is allegedly not "popular" among the Malays, Tun Dr Mahathir Mohamad claimed. The former prime minister reportedly said the PKR president needs someone like himself in order to acquire support from the Malays and win the election.'
pprint(string_news1)
('KUALA LUMPUR, July 1 - Datuk Seri Anwar Ibrahim is not suitable to as the '
 'prime minister candidate as he is allegedly not "popular" among the Malays, '
 'Tun Dr Mahathir Mohamad claimed. The former prime minister reportedly said '
 'the PKR president needs someone like himself in order to acquire support '
 'from the Malays and win the election.')
[12]:
# https://edition.cnn.com/2020/07/06/politics/new-york-attorney-general-blm/index.html

string_news2 = '(CNN)New York Attorney General Letitia James on Monday ordered the Black Lives Matter Foundation -- which she said is not affiliated with the larger Black Lives Matter movement -- to stop collecting donations in New York. "I ordered the Black Lives Matter Foundation to stop illegally accepting donations that were intended for the #BlackLivesMatter movement. This foundation is not affiliated with the movement, yet it accepted countless donations and deceived goodwill," James tweeted.'
pprint(string_news2)
('(CNN)New York Attorney General Letitia James on Monday ordered the Black '
 'Lives Matter Foundation -- which she said is not affiliated with the larger '
 'Black Lives Matter movement -- to stop collecting donations in New York. "I '
 'ordered the Black Lives Matter Foundation to stop illegally accepting '
 'donations that were intended for the #BlackLivesMatter movement. This '
 'foundation is not affiliated with the movement, yet it accepted countless '
 'donations and deceived goodwill," James tweeted.')
[13]:
# https://www.thestar.com.my/business/business-news/2020/07/04/malaysia-worries-new-eu-food-rules-could-hurt-palm-oil-exports

string_news3 = 'Amongst the wide-ranging initiatives proposed are a sustainable food labelling framework, a reformulation of processed foods, and a sustainability chapter in all EU bilateral trade agreements. The EU also plans to publish a proposal for a legislative framework for sustainable food systems by 2023 to ensure all foods on the EU market become increasingly sustainable.'
pprint(string_news3)
('Amongst the wide-ranging initiatives proposed are a sustainable food '
 'labelling framework, a reformulation of processed foods, and a '
 'sustainability chapter in all EU bilateral trade agreements. The EU also '
 'plans to publish a proposal for a legislative framework for sustainable food '
 'systems by 2023 to ensure all foods on the EU market become increasingly '
 'sustainable.')
[14]:
# https://jamesclear.com/articles

string_article1 = 'This page shares my best articles to read on topics like health, happiness, creativity, productivity and more. The central question that drives my work is, “How can we live better?” To answer that question, I like to write about science-based ways to solve practical problems.'
pprint(string_article1)
('This page shares my best articles to read on topics like health, happiness, '
 'creativity, productivity and more. The central question that drives my work '
 'is, “How can we live better?” To answer that question, I like to write about '
 'science-based ways to solve practical problems.')
[15]:
# https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536

string_article2 = 'Fuzzy matching at scale. From 3.7 hours to 0.2 seconds. How to perform intelligent string matching in a way that can scale to even the biggest data sets. Data in the real world is messy. Dealing with messy data sets is painful and burns through time which could be spent analysing the data itself.'
pprint(string_article2)
('Fuzzy matching at scale. From 3.7 hours to 0.2 seconds. How to perform '
 'intelligent string matching in a way that can scale to even the biggest data '
 'sets. Data in the real world is messy. Dealing with messy data sets is '
 'painful and burns through time which could be spent analysing the data '
 'itself.')
[16]:
random_string1 = 'i am in medical school.'
random_string2 = 'Emmerdale is the debut studio album,songs were not released in the U.S <> These songs were not released in the U.S. edition of said album and were previously unavailable on any U.S. release.'
pprint(random_string2)
('Emmerdale is the debut studio album,songs were not released in the U.S <> '
 'These songs were not released in the U.S. edition of said album and were '
 'previously unavailable on any U.S. release.')

Translate transformer base#

[12]:
%%time

pprint(transformer.greedy_decoder([string_news1, string_news2, string_news3]))
['KUALA LUMPUR 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai menjadi calon '
 'Perdana Menteri kerana beliau didakwa tidak "popular" dalam kalangan orang '
 'Melayu, Tun Dr Mahathir Mohamad mendakwa, bekas Perdana Menteri itu '
 'dilaporkan berkata Presiden PKR itu memerlukan seseorang seperti dirinya '
 'bagi mendapatkan sokongan daripada orang Melayu dan memenangi pilihan raya.',
 '(CNN) Peguam Negara New York Letitia James pada hari Isnin memerintahkan '
 'Black Lives Matter Foundation - yang menurutnya tidak berafiliasi dengan '
 'gerakan Black Lives Matter yang lebih besar - untuk berhenti mengumpulkan '
 'sumbangan di New York. "Saya memerintahkan Black Lives Matter Foundation '
 'untuk berhenti menerima sumbangan secara haram yang bertujuan untuk gerakan '
 '#BlackLivesMatter. Yayasan ini tidak berafiliasi dengan gerakan itu, namun '
 'ia menerima banyak sumbangan dan muhibah yang ditipu," tweet James.',
 'Di antara inisiatif luas yang diusulkan adalah kerangka pelabelan makanan '
 'yang berkelanjutan, reformulasi makanan yang diproses, dan bab keberlanjutan '
 'dalam semua perjanjian perdagangan dua hala EU. EU juga berencana untuk '
 'menerbitkan proposal untuk kerangka perundangan untuk sistem makanan lestari '
 'pada tahun 2023 untuk memastikan semua makanan di pasar EU menjadi semakin '
 'lestari.']
CPU times: user 24.3 s, sys: 14 s, total: 38.3 s
Wall time: 11.6 s
[13]:
%%time

pprint(transformer.greedy_decoder([string_article1, string_article2]))
['Halaman ini berkongsi artikel terbaik saya untuk dibaca mengenai topik '
 'seperti kesihatan, kebahagiaan, kreativiti, produktiviti dan banyak lagi. '
 'Soalan utama yang mendorong kerja saya adalah, "Bagaimana kita dapat hidup '
 'lebih baik?" Untuk menjawab soalan itu, saya suka menulis mengenai kaedah '
 'berasaskan sains untuk menyelesaikan masalah praktikal.',
 'Pemadanan kabur pada skala. Dari 3.7 jam hingga 0.2 saat. Cara melakukan '
 'pemadanan rentetan pintar dengan cara yang dapat meningkatkan bahkan set '
 'data terbesar. Data di dunia nyata tidak kemas. Berurusan dengan set data '
 'yang tidak kemas menyakitkan dan terbakar sepanjang masa yang dapat '
 'dihabiskan untuk menganalisis data itu sendiri.']
CPU times: user 15.9 s, sys: 9.21 s, total: 25.2 s
Wall time: 6.32 s
[14]:
%%time

pprint(transformer.greedy_decoder([random_string1, random_string2]))
['saya di sekolah perubatan.',
 'Emmerdale adalah album studio debut, lagu-lagu tidak dikeluarkan di A.S <> '
 'Lagu-lagu ini tidak dikeluarkan dalam edisi A.S. album tersebut dan '
 'sebelumnya tidak tersedia pada sebarang pelepasan A.S.']
CPU times: user 9.98 s, sys: 5.52 s, total: 15.5 s
Wall time: 4.23 s

Translate transformer small#

[15]:
%%time

pprint(transformer_small.greedy_decoder([string_news1, string_news2, string_news3]))
['KUALA LUMPUR 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai kerana calon '
 'perdana menteri kerana didakwa tidak "popular" dalam kalangan orang Melayu, '
 'Tun Dr Mahathir Mohamad mendakwa. Bekas perdana menteri itu dilaporkan '
 'berkata, presiden PKR itu memerlukan seseorang seperti dirinya sendiri untuk '
 'memperoleh sokongan daripada orang Melayu dan memenangi pilihan raya.hari '
 'ini, Datuk Seri Anwar Ibrahim tidak sesuai untuk menjadi calon',
 '(CNN) Peguam Negara New York Letitia James pada hari Isnin memerintahkan '
 'Yayasan Black Lives Matter - yang menurutnya tidak berafiliasi dengan '
 'gerakan Black Lives Matter yang lebih besar - untuk berhenti mengumpulkan '
 'sumbangan di New York. "Saya memerintahkan Yayasan Black Lives Matter untuk '
 'berhenti menerima sumbangan secara haram yang bertujuan untuk gerakan '
 '#BlackLivesMatter. Yayasan ini tidak berafiliasi dengan gerakan itu, namun '
 'ia menerima banyak sumbangan dan muhibah yang menipu," tweet James.',
 'Amongst inisiatif luas yang dicadangkan adalah kerangka kerja kerja kerja '
 'makanan yang berkelanjutan, penyusunan semula makanan yang diproses, dan bab '
 'kelestarian dalam semua perjanjian perdagangan dua hala EU. EU juga '
 'merancang untuk menerbitkan cadangan kerangka perundangan untuk sistem '
 'makanan lestari pada tahun 2023 untuk memastikan semua makanan di pasaran EU '
 'semakin lestari.']
CPU times: user 3.69 s, sys: 773 ms, total: 4.46 s
Wall time: 1.61 s
[16]:
%%time

pprint(transformer_small.greedy_decoder([string_article1, string_article2]))
['Halaman ini berkongsi artikel terbaik saya untuk membaca topik seperti '
 'kesihatan, kebahagiaan, kreativiti, produktiviti dan banyak lagi. Soalan '
 'pusat yang mendorong karya saya adalah, "Bagaimana kita dapat hidup lebih '
 'baik?" Untuk menjawab soalan itu, saya suka menulis mengenai cara berasaskan '
 'sains untuk menyelesaikan masalah praktikal.',
 'Pemadanan Fuzzy pada skala. Dari 3.7 jam hingga 0.2 saat. Cara melakukan '
 'pemadanan rentetan pintar dengan cara yang dapat meningkatkan set data '
 'terbesar bahkan. Data di dunia nyata tidak kemas. Berurusan dengan set data '
 'yang tidak kemas menyakitkan dan terbakar melalui masa yang dapat dihabiskan '
 'untuk menganalisis data itu sendiri.']
CPU times: user 2.45 s, sys: 384 ms, total: 2.84 s
Wall time: 738 ms
[17]:
%%time

pprint(transformer_small.greedy_decoder([random_string1, random_string2]))
['saya berada di sekolah perubatan.',
 'Emmerdale adalah album studio sulung, lagu-lagu tidak dikeluarkan di A.S <> '
 'Lagu-lagu ini tidak dikeluarkan di edisi A.S. yang dikatakan album dan '
 'sebelumnya tidak tersedia di mana-mana pelepasan A.S.']
CPU times: user 1.7 s, sys: 291 ms, total: 1.99 s
Wall time: 535 ms

compare with Google translate using googletrans#

Install it by,

pip3 install googletrans==4.0.0rc1
[17]:
from googletrans import Translator

translator = Translator()
[18]:
r = translator.translate(string_news1, src='en', dest = 'ms')
print(r.text)
KUALA LUMPUR, 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai sebagai calon Perdana Menteri kerana dia tidak "popular" di kalangan orang Melayu, Tun Dr Mahathir Mohamad mendakwa.Bekas Perdana Menteri dilaporkan berkata presiden PKR memerlukan seseorang seperti dirinya untuk memperoleh sokongan daripada orang Melayu dan memenangi pilihan raya.
[19]:
r = translator.translate(string_news2, src='en', dest = 'ms')
print(r.text)
(CNN) Peguam Negara New York, Letitia James pada hari Isnin mengarahkan Yayasan Black Lives Matter - yang dikatakannya tidak bergabung dengan pergerakan Black Lives Matter yang lebih besar - untuk berhenti mengumpul sumbangan di New York."Saya mengarahkan Yayasan Black Lives Matter untuk berhenti menerima sumbangan secara haram yang dimaksudkan untuk gerakan #BlackLivesMatter.
[20]:
r = translator.translate(string_news3, src='en', dest = 'ms')
print(r.text)
Di antara inisiatif yang luas yang dicadangkan adalah rangka kerja pelabelan makanan yang mampan, pembaharuan makanan yang diproses, dan bab kemampanan dalam semua perjanjian perdagangan dua hala EU.EU juga merancang untuk menerbitkan cadangan untuk rangka kerja perundangan untuk sistem makanan lestari menjelang 2023 untuk memastikan semua makanan di pasaran EU menjadi semakin mampan.
[21]:
r = translator.translate(string_article1, src='en', dest = 'ms')
print(r.text)
Halaman ini berkongsi artikel terbaik saya untuk membaca topik seperti kesihatan, kebahagiaan, kreativiti, produktiviti dan banyak lagi.Soalan utama yang mendorong kerja saya adalah, "Bagaimana kita dapat hidup lebih baik?"Untuk menjawab soalan itu, saya ingin menulis tentang cara berasaskan sains untuk menyelesaikan masalah praktikal.
[22]:
r = translator.translate(string_article2, src='en', dest = 'ms')
print(r.text)
Pencocokan kabur pada skala.Dari 3.7 jam hingga 0.2 saat.Bagaimana untuk melakukan padanan rentetan pintar dengan cara yang boleh skala ke set data terbesar.Data di dunia nyata adalah kemas.Berurusan dengan set data berantakan adalah menyakitkan dan terbakar melalui masa yang boleh dibelanjakan menganalisis data itu sendiri.