{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MS to EN HuggingFace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/ms-en-translation-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/ms-en-translation-huggingface).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module trained on standard language and augmented local language structures, proceed with caution.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.81 s, sys: 3.46 s, total: 7.27 s\n", "Wall time: 3.17 s\n" ] } ], "source": [ "%%time\n", "\n", "import malaya\n", "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace models" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.translation.ms_en:tested on FLORES200 EN-MS (eng_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200\n", "INFO:malaya.translation.ms_en:for noisy, tested on noisy twitter google translation, https://huggingface.co/datasets/mesolitica/augmentation-test-set\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)BLEUSacreBLEU VerboseSacreBLEU-chrF++-FLORES200Suggested length
mesolitica/finetune-translation-t5-tiny-standard-bahasa-cased-v213937.26048568.3/44.1/30.5/21.4 (BP = 0.995 ratio = 0.995 ...61.29256
mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v224242.01021871.7/49.0/35.6/26.1 (BP = 0.989 ratio = 0.989 ...64.67256
mesolitica/finetune-translation-t5-base-standard-bahasa-cased-v289243.40885372.3/50.5/37.1/27.7 (BP = 0.987 ratio = 0.987 ...65.44256
mesolitica/finetune-noisy-translation-t5-tiny-bahasa-cased-v313960.00096777.9/63.9/54.6/47.7 (BP = 1.000 ratio = 1.036 ...None256
mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v324264.06258280.1/67.7/59.1/52.5 (BP = 1.000 ratio = 1.042 ...None256
mesolitica/finetune-noisy-translation-t5-base-bahasa-cased-v289264.58381980.2/68.1/59.8/53.2 (BP = 1.000 ratio = 1.048 ...None256
\n", "
" ], "text/plain": [ " Size (MB) BLEU \\\n", "mesolitica/finetune-translation-t5-tiny-standar... 139 37.260485 \n", "mesolitica/finetune-translation-t5-small-standa... 242 42.010218 \n", "mesolitica/finetune-translation-t5-base-standar... 892 43.408853 \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... 139 60.000967 \n", "mesolitica/finetune-noisy-translation-t5-small-... 242 64.062582 \n", "mesolitica/finetune-noisy-translation-t5-base-b... 892 64.583819 \n", "\n", " SacreBLEU Verbose \\\n", "mesolitica/finetune-translation-t5-tiny-standar... 68.3/44.1/30.5/21.4 (BP = 0.995 ratio = 0.995 ... \n", "mesolitica/finetune-translation-t5-small-standa... 71.7/49.0/35.6/26.1 (BP = 0.989 ratio = 0.989 ... \n", "mesolitica/finetune-translation-t5-base-standar... 72.3/50.5/37.1/27.7 (BP = 0.987 ratio = 0.987 ... \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... 77.9/63.9/54.6/47.7 (BP = 1.000 ratio = 1.036 ... \n", "mesolitica/finetune-noisy-translation-t5-small-... 80.1/67.7/59.1/52.5 (BP = 1.000 ratio = 1.042 ... \n", "mesolitica/finetune-noisy-translation-t5-base-b... 80.2/68.1/59.8/53.2 (BP = 1.000 ratio = 1.048 ... \n", "\n", " SacreBLEU-chrF++-FLORES200 \\\n", "mesolitica/finetune-translation-t5-tiny-standar... 61.29 \n", "mesolitica/finetune-translation-t5-small-standa... 64.67 \n", "mesolitica/finetune-translation-t5-base-standar... 65.44 \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... None \n", "mesolitica/finetune-noisy-translation-t5-small-... None \n", "mesolitica/finetune-noisy-translation-t5-base-b... None \n", "\n", " Suggested length \n", "mesolitica/finetune-translation-t5-tiny-standar... 256 \n", "mesolitica/finetune-translation-t5-small-standa... 256 \n", "mesolitica/finetune-translation-t5-base-standar... 256 \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... 256 \n", "mesolitica/finetune-noisy-translation-t5-small-... 256 \n", "mesolitica/finetune-noisy-translation-t5-base-b... 256 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.translation.ms_en.available_huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Transformer models\n", "\n", "```python\n", "def huggingface(\n", " model: str = 'mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2',\n", " force_check: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load HuggingFace model to translate MS-to-EN.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2')\n", " Check available models at `malaya.translation.ms_en.available_huggingface()`.\n", " force_check: bool, optional (default=True)\n", " Force check model one of malaya model.\n", " Set to False if you have your own huggingface model.\n", "\n", " Returns\n", " -------\n", " result: malaya.torch_model.huggingface.Generator\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a9a0bb51ba0f4daaa4aa81af87d2dec3", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading: 0%| | 0.00/233M [00:00