MS to JAV HuggingFace#

This tutorial is available as an IPython notebook at Malaya/example/ms-jav-translation-huggingface.

This module trained on standard language and augmented local language structures, proceed with caution.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
%%time

import malaya
import logging

logging.basicConfig(level=logging.INFO)
CPU times: user 3.94 s, sys: 3.32 s, total: 7.26 s
Wall time: 3.1 s

List available HuggingFace models#

[3]:
malaya.translation.ms_jav.available_huggingface()
INFO:malaya.translation.ms_jav:tested on FLORES200 MS-JAV (zsm_Latn-jav_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200
[3]:
Size (MB) BLEU SacreBLEU Verbose SacreBLEU-chrF++-FLORES200 Suggested length
mesolitica/finetune-translation-austronesian-t5-tiny-standard-bahasa-cased 139 23.796499 59.2/31.6/18.2/10.8 (BP = 0.966 ratio = 0.967 ... 51.21 512
mesolitica/finetune-translation-austronesian-t5-small-standard-bahasa-cased 242 24.599989 58.3/31.6/18.5/11.2 (BP = 0.990 ratio = 0.990 ... 51.65 512
mesolitica/finetune-translation-austronesian-t5-base-standard-bahasa-cased 892 24.642363 60.1/32.7/19.1/11.5 (BP = 0.961 ratio = 0.961 ... 51.91 512

Load Transformer models#

def huggingface(
    model: str = 'mesolitica/finetune-translation-austronesian-t5-small-standard-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to translate MS-to-JAV.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased')
        Check available models at `malaya.translation.ms_jav.available_huggingface()`.

    Returns
    -------
    result: malaya.torch_model.huggingface.Generator
    """
[4]:
transformer_huggingface = malaya.translation.ms_jav.huggingface()

Translate#

def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    result: List[str]
    """

For better results, always split by end of sentences.

[5]:
from pprint import pprint
[6]:
text = 'saya tak suka ayam goreng dan itik'
[7]:
%%time

pprint(transformer_huggingface.generate([text],
                                       max_length = 1000))
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
['Aku ora seneng karo pitik goreng lan bebek']
CPU times: user 2.29 s, sys: 0 ns, total: 2.29 s
Wall time: 198 ms
[ ]: