{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# IND to MS HuggingFace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/ind-ms-translation-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/ind-ms-translation-huggingface).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module trained on standard language and augmented local language structures, proceed with caution.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.75 s, sys: 3.58 s, total: 7.33 s\n", "Wall time: 3.06 s\n" ] } ], "source": [ "%%time\n", "\n", "import malaya\n", "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace models" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.translation.ind_ms:tested on FLORES200 IND-MS (ind_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)BLEUSacreBLEU VerboseSacreBLEU-chrF++-FLORES200Suggested length
mesolitica/finetune-translation-austronesian-t5-tiny-standard-bahasa-cased13930.27747164.2/38.0/24.1/15.6 (BP = 0.978 ratio = 0.978 ...57.38256
mesolitica/finetune-translation-austronesian-t5-small-standard-bahasa-cased24230.2435961.1/36.9/23.8/15.6 (BP = 1.000 ratio = 1.052 ...58.43512
mesolitica/finetune-translation-austronesian-t5-base-standard-bahasa-cased89231.49467364.1/38.8/25.1/16.5 (BP = 0.989 ratio = 0.990 ...58.1512
\n", "
" ], "text/plain": [ " Size (MB) BLEU \\\n", "mesolitica/finetune-translation-austronesian-t5... 139 30.277471 \n", "mesolitica/finetune-translation-austronesian-t5... 242 30.24359 \n", "mesolitica/finetune-translation-austronesian-t5... 892 31.494673 \n", "\n", " SacreBLEU Verbose \\\n", "mesolitica/finetune-translation-austronesian-t5... 64.2/38.0/24.1/15.6 (BP = 0.978 ratio = 0.978 ... \n", "mesolitica/finetune-translation-austronesian-t5... 61.1/36.9/23.8/15.6 (BP = 1.000 ratio = 1.052 ... \n", "mesolitica/finetune-translation-austronesian-t5... 64.1/38.8/25.1/16.5 (BP = 0.989 ratio = 0.990 ... \n", "\n", " SacreBLEU-chrF++-FLORES200 \\\n", "mesolitica/finetune-translation-austronesian-t5... 57.38 \n", "mesolitica/finetune-translation-austronesian-t5... 58.43 \n", "mesolitica/finetune-translation-austronesian-t5... 58.1 \n", "\n", " Suggested length \n", "mesolitica/finetune-translation-austronesian-t5... 256 \n", "mesolitica/finetune-translation-austronesian-t5... 512 \n", "mesolitica/finetune-translation-austronesian-t5... 512 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.translation.ind_ms.available_huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Transformer models\n", "\n", "```python\n", "def huggingface(\n", " model: str = 'mesolitica/finetune-translation-austronesian-t5-small-standard-bahasa-cased',\n", " force_check: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load HuggingFace model to translate IND-to-MS.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased')\n", " Check available models at `malaya.translation.ind_ms.available_huggingface()`.\n", "\n", " Returns\n", " -------\n", " result: malaya.torch_model.huggingface.Generator\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "transformer_huggingface = malaya.translation.ind_ms.huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Translate\n", "\n", "```python\n", "def generate(self, strings: List[str], **kwargs):\n", " \"\"\"\n", " Generate texts from the input.\n", "\n", " Parameters\n", " ----------\n", " strings : List[str]\n", " **kwargs: vector arguments pass to huggingface `generate` method.\n", " Read more at https://huggingface.co/docs/transformers/main_classes/text_generation\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```\n", "\n", "**For better results, always split by end of sentences**." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from pprint import pprint" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "news = 'Jakarta Pusat, Kominfo - Presiden Joko Widodo menerima surat kepercayaan dari sebelas duta besar luar biasa dan berkuasa penuh (LBBP) negara-negara sahabat. Penyerahan surat kepercayaan tersebut digelar di Ruang Kredensial, Istana Merdeka, Jakarta Pusat, Senin (20/02/2023). Prosesi acara penyerahan surat kepercayaan dimulai dengan diperdengarkannya lagu kebangsaan dari masing-masing negara sahabat setelah duta besar tiba di Istana Merdeka. Adapun kesebelas duta besar negara sahabat yang diterima oleh Presiden'\n", "text = 'Saya tidak suka ikan keli dan ayam goreng'" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Jakarta Pusat, Kominfo - Presiden Joko Widodo menerima surat kepercayaan '\n", " 'dari sebelas duta luar biasa dan penuh kuasa (LBBP) negara sahabat '\n", " 'Penyerahan surat kepercayaan itu digelar di Ruang Kredensial, Istana '\n", " 'Merdeka, Jakarta Pusat, Senin (20/02/2023) Prosesi penyerahan surat '\n", " 'kepercayaan dimulai dengan diperdengarkannya lagu kebangsaan dari '\n", " 'masing-masing negara sahabat setelah duta besar tiba di Istana Merdeka '\n", " 'Adapun kesebelas duta negara sahabat yang diterima Presiden.',\n", " 'Saya tidak suka lele dan ayam goreng']\n", "CPU times: user 13.3 s, sys: 1.45 ms, total: 13.3 s\n", "Wall time: 1.13 s\n" ] } ], "source": [ "%%time\n", "\n", "pprint(transformer_huggingface.generate([news, text],\n", " max_length = 1000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### compare with Google translate using googletrans\n", "\n", "Install it by,\n", "\n", "```bash\n", "pip3 install googletrans==4.0.0rc1\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from googletrans import Translator\n", "\n", "translator = Translator()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "strings = [news, text]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JAKARTA PUSAT, Kominfo - Presiden Joko Widodo menerima surat kepercayaan dari sebelas duta hebat dan kuasa penuh (LBBP) negara -negara yang mesra.Penyerahan surat amanah telah diadakan di Bilik Kredensial, Istana Merdeka, Jakarta Tengah, Isnin (02/20/2023).Perarakan penyerahan kepercayaan bermula dengan bermain lagu kebangsaan dari setiap negara kawan selepas duta besar tiba di Istana Merdeka.Bagi sebelas duta negara yang mesra yang diterima oleh presiden\n", "Saya tidak suka ikan keli dan ayam goreng\n" ] } ], "source": [ "for t in strings:\n", " r = translator.translate(t, src='id', dest = 'ms')\n", " print(r.text)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }