{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Jawi-to-Rumi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/jawi-rumi](https://github.com/huseinzol05/Malaya/tree/master/example/jawi-rumi).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module heavily trained on news and wikipedia dataset.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanation\n", "\n", "Originally from https://www.ejawi.net/converterV2.php?go=rumi able to convert Rumi to Jawi using heuristic method. So Malaya convert from heuristic and map it using deep learning model by inverse the dataset.\n", "\n", "`چوميل` -> `comel`" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:numexpr.utils:NumExpr defaulting to 8 threads.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.71 s, sys: 1.06 s, total: 6.77 s\n", "Wall time: 7.65 s\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available Transformer model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.jawi_rumi:tested on first 10k Jawi-Rumi test set, dataset at https://huggingface.co/datasets/mesolitica/rumi-jawi\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)CERWERSuggested length
small42.713.10.0044770.013642256.0
base234.063.80.0007640.003042256.0
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) CER WER Suggested length\n", "small 42.7 13.1 0.004477 0.013642 256.0\n", "base 234.0 63.8 0.000764 0.003042 256.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.jawi_rumi.available_transformer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Transformer model\n", "\n", "```python\n", "def transformer(model='base', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load transformer encoder-decoder model to convert jawi to rumi.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='base')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'small'`` - Transformer SMALL parameters.\n", " * ``'base'`` - Transformer BASE parameters.\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result: malaya.model.tf.JawiRumi class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "model = malaya.jawi_rumi.transformer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "quantized_model = malaya.jawi_rumi.transformer(quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict using greedy decoder\n", "\n", "```python\n", "def greedy_decoder(self, strings: List[str]):\n", " \"\"\"\n", " Convert list of jawi strings to rumi strings.\n", " 'ايسو بيل تنب دباوا ك كابينيت - صيفالدين' -> 'isu bil tnb dibawa ke kabinet - saifuddin'\n", "\n", " Parameters\n", " ----------\n", " strings : List[str]\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", " return self._greedy_decoder(strings)\n", "```" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "strings = ['د لابوان ابفچ.',\n", " 'سبلومڽ لبيه باڽق جوملهڽ.',\n", " 'دان ممرلوكن ڤمبلاان.',\n", " 'يڠ لاين.',\n", " 'كريتا ڤروندا درڤد بالاي ڤوليس باچوق.',\n", " 'سلڤس ٢٨ ڤوسيڠن رونديڠن دان ١٨ مشوارت منتري سلاما كيرا-كيرا توجوه تاهون، رونديڠن ايت',\n", " 'ڤنجاڬ ڤرلو فهم دان اد علمو اوروس ورڬ امس، ايلق ڽاڽوق لبيه تروق.',\n", " 'ڬوندڠ اداله تيدق بنر، كات كمنترين ڤرتانين دان ايندوستري اساس تاني ﴿موا﴾.',\n", " 'بلياو ﴿ازهم﴾ داتڠ ك فام ڤد خميس لڤس برجومڤا دڠن ستياءوسها اڬوڠ فام ﴿ستوارت راماليڠام﴾ سلڤس ايت كلوار دڠن كڽاتأن',\n", " 'يڠ توروت حاضر، تيمبالن ڤردان منتري، داتوق سري در وان عزيزه وان اسماعيل دان منتري كابينيت.']" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['di labuan ibfc.',\n", " 'sebelumnya lebih banyak jumlahnya.',\n", " 'dan memerlukan pembelaan. dan memerlukan pembelaan.',\n", " 'yang lain.',\n", " 'kereta peronda daripada balai polis bachok.',\n", " 'selepas 28 pusingan rundingan dan 18 mesyuarat menteri selama kira-kira tujuh tahun, rundingan itu',\n", " 'penjaga perlu faham dan ada ilmu urus warga emas, elak nyanyuk lebih teruk.',\n", " 'gondang adalah tidak benar, kata kementerian pertanian dan industri asas tani (moa).',\n", " 'beliau (izham) datang ke fam pada khamis lepas berjumpa dengan setiausaha agung fam (stuart ramalingam) selepas itu keluar dengan kenyataan',\n", " 'yang turut hadir, timbalan perdana menteri, datuk seri dr wan azizah wan ismail dan menteri kabinet.']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.greedy_decoder(strings)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "['di labuan ibfc.',\n", " 'sebelumnya lebih banyak jumlahnya.',\n", " 'dan memerlukan pembelaan. dan memerlukan pembelaan.',\n", " 'yang lain.',\n", " 'kereta peronda daripada balai polis bachok.',\n", " 'selepas 28 pusingan rundingan dan 18 mesyuarat menteri selama kira-kira tujuh tahun, rundingan itu',\n", " 'penjaga perlu faham dan ada ilmu urus warga emas, elak nyanyuk lebih teruk.',\n", " 'gondang adalah tidak benar, kata kementerian pertanian dan industri asas tani (moa).',\n", " 'beliau (izham) datang ke fam pada khamis lepas berjumpa dengan setiausaha agung fam (stuart ramalingam) selepas itu keluar dengan kenyataan',\n", " 'yang turut hadir, timbalan perdana menteri, datuk seri dr wan azizah wan ismail dan menteri kabinet.']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_model.greedy_decoder(strings)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "string = 'ساي سوك ماكن ايم'" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['saya suka makan ayam']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.greedy_decoder([string])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['saya suka makan ayam']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_model.greedy_decoder([string])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }