JAV to MS HuggingFace
Contents
JAV to MS HuggingFace#
This tutorial is available as an IPython notebook at Malaya/example/jav-ms-translation-huggingface.
This module trained on standard language and augmented local language structures, proceed with caution.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
%%time
import malaya
import logging
logging.basicConfig(level=logging.INFO)
CPU times: user 4.1 s, sys: 3.32 s, total: 7.42 s
Wall time: 3.35 s
List available HuggingFace models#
[3]:
malaya.translation.jav_ms.available_huggingface()
INFO:malaya.translation.jav_ms:tested on FLORES200 JAV-MS (jav_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200
[3]:
Size (MB) | BLEU | SacreBLEU Verbose | SacreBLEU-chrF++-FLORES200 | Suggested length | |
---|---|---|---|---|---|
mesolitica/finetune-translation-austronesian-t5-tiny-standard-bahasa-cased | 139 | 23.797628 | 58.2/31.1/18.1/10.8 (BP = 0.977 ratio = 0.977 ... | 50.65 | 512 |
mesolitica/finetune-translation-austronesian-t5-small-standard-bahasa-cased | 242 | 25.244377 | 57.7/31.9/19.0/11.6 (BP = 1.000 ratio = 1.022 ... | 52.58 | 512 |
mesolitica/finetune-translation-austronesian-t5-base-standard-bahasa-cased | 892 | 25.772897 | 58.9/32.6/19.6/12.1 (BP = 0.992 ratio = 0.992 ... | 52.21 | 512 |
Load Transformer models#
def huggingface(
model: str = 'mesolitica/finetune-translation-austronesian-t5-small-standard-bahasa-cased',
force_check: bool = True,
**kwargs,
):
"""
Load HuggingFace model to translate JAV-to-MS.
Parameters
----------
model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased')
Check available models at `malaya.translation.jav_ms.available_huggingface()`.
Returns
-------
result: malaya.torch_model.huggingface.Generator
"""
[4]:
transformer_huggingface = malaya.translation.jav_ms.huggingface()
Translate#
def generate(self, strings: List[str], **kwargs):
"""
Generate texts from the input.
Parameters
----------
strings : List[str]
**kwargs: vector arguments pass to huggingface `generate` method.
Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
Returns
-------
result: List[str]
"""
For better results, always split by end of sentences.
[5]:
from pprint import pprint
[6]:
text = 'Aku ora seneng lele lan pitik goreng'
[7]:
%%time
pprint(transformer_huggingface.generate([text],
max_length = 1000))
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
['Saya tidak suka lele dan ayam goreng']
CPU times: user 1.08 s, sys: 0 ns, total: 1.08 s
Wall time: 95.1 ms