{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# EN to MS Noisy HuggingFace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/noisy-en-ms-translation-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/noisy-en-ms-translation-huggingface).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module trained on standard language and augmented local language structures, proceed with caution.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.27 s, sys: 3.58 s, total: 6.85 s\n", "Wall time: 2.26 s\n" ] } ], "source": [ "%%time\n", "\n", "import malaya\n", "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace models" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.translation.en_ms:tested on FLORES200 EN-MS (eng_Latn-zsm_Latn) pair `dev` set, https://github.com/facebookresearch/flores/tree/main/flores200\n", "INFO:malaya.translation.en_ms:for noisy, tested on noisy twitter google translation, https://huggingface.co/datasets/mesolitica/augmentation-test-set\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)BLEUSacreBLEU VerboseSacreBLEU-chrF++-FLORES200Suggested length
mesolitica/finetune-translation-t5-super-super-tiny-standard-bahasa-cased23.336.29074371.2/46.0/30.9/21.0 (BP = 0.950 ratio = 0.951 ...61.89256
mesolitica/finetune-translation-t5-super-tiny-standard-bahasa-cased50.739.18834272.6/48.3/33.5/23.6 (BP = 0.960 ratio = 0.961 ...64.03256
mesolitica/finetune-translation-t5-tiny-standard-bahasa-cased13941.62553673.4/50.1/35.7/25.7 (BP = 0.971 ratio = 0.972 ...65.7256
mesolitica/finetune-translation-t5-small-standard-bahasa-cased24243.93729874.9/52.2/37.9/27.7 (BP = 0.976 ratio = 0.977 ...67.43256
mesolitica/finetune-translation-t5-base-standard-bahasa-cased89244.17355974.7/52.3/38.0/28.0 (BP = 0.979 ratio = 0.979 ...67.6256
mesolitica/finetune-noisy-translation-t5-tiny-bahasa-cased13941.03641472.9/49.2/34.8/25.0 (BP = 0.977 ratio = 0.977 ...65.58256
mesolitica/finetune-noisy-translation-t5-small-bahasa-cased24241.1579472.2/48.8/34.5/24.8 (BP = 0.988 ratio = 0.988 ...65.51256
mesolitica/finetune-noisy-translation-t5-base-bahasa-cased89241.82783173.4/50.1/35.7/25.8 (BP = 0.982 ratio = 0.982 ...66.51256
mesolitica/finetune-noisy-translation-t5-tiny-bahasa-cased-v213960.00096777.9/63.9/54.6/47.7 (BP = 1.000 ratio = 1.036 ...None256
mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v424264.06258280.1/67.7/59.1/52.5 (BP = 1.000 ratio = 1.042 ...None256
mesolitica/finetune-noisy-translation-t5-base-bahasa-cased-v289264.58381980.2/68.1/59.8/53.2 (BP = 1.000 ratio = 1.048 ...None256
\n", "
" ], "text/plain": [ " Size (MB) BLEU \\\n", "mesolitica/finetune-translation-t5-super-super-... 23.3 36.290743 \n", "mesolitica/finetune-translation-t5-super-tiny-s... 50.7 39.188342 \n", "mesolitica/finetune-translation-t5-tiny-standar... 139 41.625536 \n", "mesolitica/finetune-translation-t5-small-standa... 242 43.937298 \n", "mesolitica/finetune-translation-t5-base-standar... 892 44.173559 \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... 139 41.036414 \n", "mesolitica/finetune-noisy-translation-t5-small-... 242 41.15794 \n", "mesolitica/finetune-noisy-translation-t5-base-b... 892 41.827831 \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... 139 60.000967 \n", "mesolitica/finetune-noisy-translation-t5-small-... 242 64.062582 \n", "mesolitica/finetune-noisy-translation-t5-base-b... 892 64.583819 \n", "\n", " SacreBLEU Verbose \\\n", "mesolitica/finetune-translation-t5-super-super-... 71.2/46.0/30.9/21.0 (BP = 0.950 ratio = 0.951 ... \n", "mesolitica/finetune-translation-t5-super-tiny-s... 72.6/48.3/33.5/23.6 (BP = 0.960 ratio = 0.961 ... \n", "mesolitica/finetune-translation-t5-tiny-standar... 73.4/50.1/35.7/25.7 (BP = 0.971 ratio = 0.972 ... \n", "mesolitica/finetune-translation-t5-small-standa... 74.9/52.2/37.9/27.7 (BP = 0.976 ratio = 0.977 ... \n", "mesolitica/finetune-translation-t5-base-standar... 74.7/52.3/38.0/28.0 (BP = 0.979 ratio = 0.979 ... \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... 72.9/49.2/34.8/25.0 (BP = 0.977 ratio = 0.977 ... \n", "mesolitica/finetune-noisy-translation-t5-small-... 72.2/48.8/34.5/24.8 (BP = 0.988 ratio = 0.988 ... \n", "mesolitica/finetune-noisy-translation-t5-base-b... 73.4/50.1/35.7/25.8 (BP = 0.982 ratio = 0.982 ... \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... 77.9/63.9/54.6/47.7 (BP = 1.000 ratio = 1.036 ... \n", "mesolitica/finetune-noisy-translation-t5-small-... 80.1/67.7/59.1/52.5 (BP = 1.000 ratio = 1.042 ... \n", "mesolitica/finetune-noisy-translation-t5-base-b... 80.2/68.1/59.8/53.2 (BP = 1.000 ratio = 1.048 ... \n", "\n", " SacreBLEU-chrF++-FLORES200 \\\n", "mesolitica/finetune-translation-t5-super-super-... 61.89 \n", "mesolitica/finetune-translation-t5-super-tiny-s... 64.03 \n", "mesolitica/finetune-translation-t5-tiny-standar... 65.7 \n", "mesolitica/finetune-translation-t5-small-standa... 67.43 \n", "mesolitica/finetune-translation-t5-base-standar... 67.6 \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... 65.58 \n", "mesolitica/finetune-noisy-translation-t5-small-... 65.51 \n", "mesolitica/finetune-noisy-translation-t5-base-b... 66.51 \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... None \n", "mesolitica/finetune-noisy-translation-t5-small-... None \n", "mesolitica/finetune-noisy-translation-t5-base-b... None \n", "\n", " Suggested length \n", "mesolitica/finetune-translation-t5-super-super-... 256 \n", "mesolitica/finetune-translation-t5-super-tiny-s... 256 \n", "mesolitica/finetune-translation-t5-tiny-standar... 256 \n", "mesolitica/finetune-translation-t5-small-standa... 256 \n", "mesolitica/finetune-translation-t5-base-standar... 256 \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... 256 \n", "mesolitica/finetune-noisy-translation-t5-small-... 256 \n", "mesolitica/finetune-noisy-translation-t5-base-b... 256 \n", "mesolitica/finetune-noisy-translation-t5-tiny-b... 256 \n", "mesolitica/finetune-noisy-translation-t5-small-... 256 \n", "mesolitica/finetune-noisy-translation-t5-base-b... 256 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.translation.en_ms.available_huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Transformer models\n", "\n", "```python\n", "def huggingface(\n", " model: str = 'mesolitica/finetune-translation-t5-small-standard-bahasa-cased',\n", " force_check: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load HuggingFace model to translate EN-to-MS.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased')\n", " Check available models at `malaya.translation.en_ms.available_huggingface()`.\n", " force_check: bool, optional (default=True)\n", " Force check model one of malaya model.\n", " Set to False if you have your own huggingface model.\n", "\n", " Returns\n", " -------\n", " result: malaya.torch_model.huggingface.Generator\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "transformer = malaya.translation.en_ms.huggingface()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "transformer_noisy = malaya.translation.en_ms.huggingface(model = 'mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v4')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Translate\n", "\n", "```python\n", "def generate(self, strings: List[str], **kwargs):\n", " \"\"\"\n", " Generate texts from the input.\n", "\n", " Parameters\n", " ----------\n", " strings : List[str]\n", " **kwargs: vector arguments pass to huggingface `generate` method.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```\n", "\n", "**For better results, always split by end of sentences**." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from pprint import pprint" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('KUALA LUMPUR, July 1 - Datuk Seri Anwar Ibrahim is not suitable to as the '\n", " 'prime minister candidate as he is allegedly not \"popular\" among the Malays, '\n", " 'Tun Dr Mahathir Mohamad claimed. The former prime minister reportedly said '\n", " 'the PKR president needs someone like himself in order to acquire support '\n", " 'from the Malays and win the election.')\n" ] } ], "source": [ "# https://www.malaymail.com/news/malaysia/2020/07/01/dr-mahathir-again-claims-anwar-lacks-popularity-with-malays-to-be-pakatans/1880420\n", "\n", "string_news1 = 'KUALA LUMPUR, July 1 - Datuk Seri Anwar Ibrahim is not suitable to as the prime minister candidate as he is allegedly not \"popular\" among the Malays, Tun Dr Mahathir Mohamad claimed. The former prime minister reportedly said the PKR president needs someone like himself in order to acquire support from the Malays and win the election.'\n", "pprint(string_news1)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('(CNN)New York Attorney General Letitia James on Monday ordered the Black '\n", " 'Lives Matter Foundation -- which she said is not affiliated with the larger '\n", " 'Black Lives Matter movement -- to stop collecting donations in New York. \"I '\n", " 'ordered the Black Lives Matter Foundation to stop illegally accepting '\n", " 'donations that were intended for the #BlackLivesMatter movement. This '\n", " 'foundation is not affiliated with the movement, yet it accepted countless '\n", " 'donations and deceived goodwill,\" James tweeted.')\n" ] } ], "source": [ "# https://edition.cnn.com/2020/07/06/politics/new-york-attorney-general-blm/index.html\n", "\n", "string_news2 = '(CNN)New York Attorney General Letitia James on Monday ordered the Black Lives Matter Foundation -- which she said is not affiliated with the larger Black Lives Matter movement -- to stop collecting donations in New York. \"I ordered the Black Lives Matter Foundation to stop illegally accepting donations that were intended for the #BlackLivesMatter movement. This foundation is not affiliated with the movement, yet it accepted countless donations and deceived goodwill,\" James tweeted.'\n", "pprint(string_news2)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Amongst the wide-ranging initiatives proposed are a sustainable food '\n", " 'labelling framework, a reformulation of processed foods, and a '\n", " 'sustainability chapter in all EU bilateral trade agreements. The EU also '\n", " 'plans to publish a proposal for a legislative framework for sustainable food '\n", " 'systems by 2023 to ensure all foods on the EU market become increasingly '\n", " 'sustainable.')\n" ] } ], "source": [ "# https://www.thestar.com.my/business/business-news/2020/07/04/malaysia-worries-new-eu-food-rules-could-hurt-palm-oil-exports\n", "\n", "string_news3 = 'Amongst the wide-ranging initiatives proposed are a sustainable food labelling framework, a reformulation of processed foods, and a sustainability chapter in all EU bilateral trade agreements. The EU also plans to publish a proposal for a legislative framework for sustainable food systems by 2023 to ensure all foods on the EU market become increasingly sustainable.'\n", "pprint(string_news3)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('This page shares my best articles to read on topics like health, happiness, '\n", " 'creativity, productivity and more. The central question that drives my work '\n", " 'is, โ€œHow can we live better?โ€ To answer that question, I like to write about '\n", " 'science-based ways to solve practical problems.')\n" ] } ], "source": [ "# https://jamesclear.com/articles\n", "\n", "string_article1 = 'This page shares my best articles to read on topics like health, happiness, creativity, productivity and more. The central question that drives my work is, โ€œHow can we live better?โ€ To answer that question, I like to write about science-based ways to solve practical problems.'\n", "pprint(string_article1)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "['KUALA LUMPUR, 1 Julai - Datuk Seri Anwar Ibrahim tidak sesuai sebagai calon '\n", " 'perdana menteri kerana kononnya tidak \"popular\" di kalangan orang Melayu, '\n", " 'Tun Dr Mahathir Mohamad mendakwa. Bekas perdana menteri dilaporkan berkata '\n", " 'presiden PKR memerlukan seseorang seperti dirinya demi memperoleh sokongan '\n", " 'daripada orang Melayu dan memenangi pilihan raya.',\n", " '(CNN) Peguam Negara New York Letitia James pada hari Isnin mengarahkan '\n", " 'Yayasan Black Lives Matter -- yang katanya tidak bergabung dengan gerakan '\n", " 'Black Lives Matter yang lebih besar -- untuk berhenti mengutip sumbangan di '\n", " 'New York. \"Saya mengarahkan Yayasan Black Lives Matter untuk berhenti secara '\n", " 'haram menerima sumbangan yang ditujukan untuk gerakan #BlackLivesMatter. '\n", " 'Yayasan ini tidak bergabung dengan gerakan tersebut, namun menerima '\n", " 'sumbangan yang tidak terkira dan tertipu muhibah,\" James tweet.',\n", " 'Antara inisiatif yang luas dicadang adalah rangka label makanan lestari, '\n", " 'reformasi makanan yang diproses, dan bab kelestarian dalam semua perjanjian '\n", " 'perdagangan dua hala EU. EU juga bercadang untuk menerbitkan cadangan rangka '\n", " 'perundangan bagi sistem makanan lestari menjelang 2023 bagi memastikan semua '\n", " 'makanan di pasaran EU menjadi semakin mampan.']\n", "CPU times: user 15.9 s, sys: 2.17 ms, total: 15.9 s\n", "Wall time: 1.35 s\n" ] } ], "source": [ "%%time\n", "\n", "pprint(transformer_noisy.generate([string_news1, string_news2, string_news3],\n", " max_length = 1000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### compare results using local language structure" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "strings = [\n", " 'u ni, talk properly lah',\n", " \"just attended my cousin's wedding. pelik jugak dia buat majlis biasa2 je sebab her lifestyle looks lavish. then i found out they're going on a 3 weeks honeymoon. smart decision ๐Ÿ‘\",\n", " 'Me after seeing this video: mm dapnya burger benjo extra mayo',\n", " 'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:',\n", "]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# do not forget to inject this initial prompt!\n", "\n", "transformer_noisy._initial_text = 'terjemah pasar Melayu ke Melayu: '" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.92 s, sys: 820 ยตs, total: 5.92 s\n", "Wall time: 499 ms\n" ] }, { "data": { "text/plain": [ "['ini awak, bercakap dengan betul',\n", " 'baru menghadiri majlis perkahwinan sepupu saya. Peliknya dia buat majlis biasa-biasa saja sebab gaya hidup dia nampak mewah. kemudian saya mendapat tahu mereka sedang berbulan madu selama 3 minggu. keputusan yang bijak ',\n", " 'Selepas menonton video ini: burger bejo tambahan mayo memang sedap',\n", " 'Hai kawan-kawan! Saya perasan semalam & hari ini ramai yang dapat biskut kan? Jadi hari ini saya ingin berkongsi beberapa post mortem batch pertama kami:']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "transformer_noisy.generate(strings, max_length = 1000)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.32 s, sys: 4.04 ms, total: 7.32 s\n", "Wall time: 616 ms\n" ] }, { "data": { "text/plain": [ "['u ni, bercakap dengan betul',\n", " 'baru sahaja menghadiri majlis perkahwinan sepupu saya. pelik jugak dia buat majlis biasa2 je kerana gaya hidupnya kelihatan mewah. maka saya mendapat tahu bahawa mereka akan berbulan madu selama 3 minggu. keputusan pintar ',\n", " 'Saya selepas melihat video ini: mm dapnya burger benjo extra mayo',\n", " 'Hai kawan-kawan! Saya perhatikan semalam & harini dah banyak yang dapat kuki ni kan. Jadi harini saya ingin berkongsi beberapa post mortem kumpulan pertama kami:']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "transformer.generate(strings, max_length = 1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### compare with Google translate using googletrans\n", "\n", "Install it by,\n", "\n", "```bash\n", "pip3 install googletrans==4.0.0rc1\n", "```" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from googletrans import Translator\n", "\n", "translator = Translator()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "u ni, bercakap dengan betul lah\n", "Baru sahaja menghadiri majlis perkahwinan sepupu saya.Pelik Jugak Dia Buat Majlis Biasa2 Je Sebab Gaya Hidupnya kelihatan mewah.Kemudian saya dapati mereka akan berbulan madu selama 3 minggu.Keputusan Pintar ๐Ÿ‘\n", "Saya setelah melihat video ini: mm dapnya burger benjo tambahan mayo\n", "Hai semua!Saya perhatikan Semalam & Harini Dah Ramai Yang Dapate Cookies Ni Kan.Jadi harini i nak berkongsi beberapa bedah siasat kumpulan pertama kami:\n" ] } ], "source": [ "for t in strings:\n", " r = translator.translate(t, src='en', dest = 'ms')\n", " print(r.text)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }