{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Compare LM on Spelling Correction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/compare-lm-spelling-correction](https://github.com/huseinzol05/Malaya/tree/master/example/compare-lm-spelling-correction).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Malaya got 3 different LM models,\n", "\n", "1. KenLM\n", "2. GPT2\n", "3. Masked LM\n", "\n", "So we are going to compare the spelling correction results." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''\n", "os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", " warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "import malaya" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# some text examples copied from Twitter\n", "\n", "string1 = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'\n", "string2 = 'Husein ska mkn aym dkat kampng Jawa'\n", "string3 = 'Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.'\n", "string4 = 'Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.'\n", "string5 = 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'\n", "string6 = 'blh bntg dlm kls nlp sy, nnti intch'\n", "string7 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load probability model\n", "\n", "```python\n", "def load(\n", " language_model=None,\n", " sentence_piece: bool = False,\n", " stemmer=None,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load a Probability Spell Corrector.\n", "\n", " Parameters\n", " ----------\n", " language_model: Callable, optional (default=None)\n", " If not None, must an object with `score` method.\n", " sentence_piece: bool, optional (default=False)\n", " if True, reduce possible augmentation states using sentence piece.\n", " stemmer: Callable, optional (default=None)\n", " a Callable object, must have `stem_word` method.\n", "\n", " Returns\n", " -------\n", " result: model\n", " List of model classes:\n", "\n", " * if passed `language_model` will return `malaya.spelling_correction.probability.ProbabilityLM`.\n", " * else will return `malaya.spelling_correction.probability.Probability`.\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kenlm = malaya.language_model.kenlm()\n", "kenlm" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'mesolitica/bert-base-standard-bahasa-cased': {'Size (MB)': 310},\n", " 'mesolitica/bert-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},\n", " 'mesolitica/roberta-base-standard-bahasa-cased': {'Size (MB)': 443},\n", " 'mesolitica/roberta-tiny-standard-bahasa-cased': {'Size (MB)': 66.1}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.language_model.available_mlm" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n" ] } ], "source": [ "bert_base = malaya.language_model.mlm(model = 'mesolitica/bert-base-standard-bahasa-cased')\n", "roberta_base = malaya.language_model.mlm(model = 'mesolitica/roberta-base-standard-bahasa-cased')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mesolitica/gpt2-117m-bahasa-cased': {'Size (MB)': 454}}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.language_model.available_gpt2" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc\n" ] } ], "source": [ "gpt2 = malaya.language_model.gpt2(model = 'mesolitica/gpt2-117m-bahasa-cased')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "model_kenlm = malaya.spelling_correction.probability.load(language_model = kenlm)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "model_bert_base = malaya.spelling_correction.probability.load(language_model = bert_base)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "model_roberta_base = malaya.spelling_correction.probability.load(language_model = roberta_base)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "model_gpt2 = malaya.spelling_correction.probability.load(language_model = gpt2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To correct a sentence\n", "\n", "```python\n", "def correct_text(\n", " self,\n", " text: str,\n", " lookback: int = 3,\n", " lookforward: int = 3,\n", "):\n", " \"\"\"\n", " Correct all the words within a text, returning the corrected text.\n", "\n", " Parameters\n", " ----------\n", " text: str\n", " lookback: int, optional (default=3)\n", " N words on the left hand side.\n", " if put -1, will take all words on the left hand side.\n", " longer left hand side will take longer to compute.\n", " lookforward: int, optional (default=3)\n", " N words on the right hand side.\n", " if put -1, will take all words on the right hand side.\n", " longer right hand side will take longer to compute.\n", "\n", " Returns\n", " -------\n", " result: str\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "strings = [string1, string2, string3, string4, string5, string6, string7]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "tokenizer = malaya.tokenizer.Tokenizer()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi\n", "corrected: kerajaan patut bagi pencen awal sikit kpd warga emas supaya emosi\n", "\n", "original: Husein ska mkn aym dkat kampng Jawa\n", "corrected: Husin ska makan ayam dekat kampung Jawa\n", "\n", "original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.\n", "corrected: Melayu malas ni narration dia sama je macam men are trash . True to some , false to some .\n", "\n", "original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.\n", "corrected: Tapi tak pikir ke bahaya perpetuate myths camtu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah . Your kids will be victims of that too .\n", "\n", "original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager\n", "corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now has i am edging towards retirement ini 4 - 5 years time after a career of being ini Engineer , Project Manager , General Manager\n", "\n", "original: blh bntg dlm kls nlp sy, nnti intch\n", "corrected: blh bintang dlm kelas nlp saya , nnti intch\n", "\n", "original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik\n", "corrected: mulakan slh org boleh , bila geng tuh kena salahkan jgk xboleh trima . . pelik\n", "\n" ] } ], "source": [ "for s in strings:\n", " tokenized = tokenizer.tokenize(s)\n", " print('original:', s)\n", " print('corrected:', model_kenlm.correct_text(' '.join(tokenized)))\n", " print()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi\n", "corrected: kerajaan patut bagi pencen awal sikit kpd warga emas supaya emosi\n", "\n", "original: Husein ska mkn aym dkat kampng Jawa\n", "corrected: Husin ska mkn ayam dekat kampung Jawa\n", "\n", "original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.\n", "corrected: Melayu malas ni narration dia sama je macam men are trash . True to some , false to some .\n", "\n", "original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.\n", "corrected: Tapi tak pikir ke bahaya perpetuate myths camtu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah . Your kids will be victims of that too .\n", "\n", "original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager\n", "corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now has i am edging towards retirement vin 4 - 5 years time after a career of being ini Engineer , Project Manager , General Manager\n", "\n", "original: blh bntg dlm kls nlp sy, nnti intch\n", "corrected: blh bantang dlm kelas nlp sya , nnti intch\n", "\n", "original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik\n", "corrected: mulakan slh org boleh , bila geng tuh kena salahkan jgk xboleh trima . . pelik\n", "\n" ] } ], "source": [ "for s in strings:\n", " tokenized = tokenizer.tokenize(s)\n", " print('original:', s)\n", " print('corrected:', model_bert_base.correct_text(' '.join(tokenized)))\n", " print()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi\n", "corrected: kerjaan patut bagi pencen awal sikit kpd warga emas supaya emosi\n", "\n", "original: Husein ska mkn aym dkat kampng Jawa\n", "corrected: Hussein ska mkn ayam dekat kampung Jawa\n", "\n", "original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.\n", "corrected: Melayu malas ni narration dia sama je macam men are trash . True to some , false to some .\n", "\n", "original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.\n", "corrected: Tapi tak pikir ke bahaya perpetuate myths camtu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah . Your kids will be victims of that too .\n", "\n", "original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager\n", "corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now has i am edging towards retirement ini 4 - 5 years time after a career of being ini Engineer , Project Manager , General Manager\n", "\n", "original: blh bntg dlm kls nlp sy, nnti intch\n", "corrected: blh bentang dlm kelas nlp saya , nnti intch\n", "\n", "original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik\n", "corrected: mulakan slh org boleh , bila geng tuh kena salahkan jgk xboleh trima . . pelik\n", "\n" ] } ], "source": [ "for s in strings:\n", " tokenized = tokenizer.tokenize(s)\n", " print('original:', s)\n", " print('corrected:', model_roberta_base.correct_text(' '.join(tokenized)))\n", " print()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi\n", "corrected: kerajaan patut bagi pencen awal sikit kpd warga emas supaya emosi\n", "\n", "original: Husein ska mkn aym dkat kampng Jawa\n", "corrected: Husen ska mkn ayam dekat kampung Jawa\n", "\n", "original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.\n", "corrected: Melayu malas ni narration dia sama je macam men are trash . True to some , false to some .\n", "\n", "original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.\n", "corrected: Tapi tak pikir ke bahaya perpetuate myths camtu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah . Your kids will be victims of that too .\n", "\n", "original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager\n", "corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now has i am edging towards retirement ini 4 - 5 years time after a career of being ane Engineer , Project Manager , General Manager\n", "\n", "original: blh bntg dlm kls nlp sy, nnti intch\n", "corrected: blh binatang dlm kelas nlp saya , nnti intch\n", "\n", "original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik\n", "corrected: mulakan slh org boleh , bila geng tuh kena salahkan jgk xboleh trima . . pelik\n", "\n" ] } ], "source": [ "for s in strings:\n", " tokenized = tokenizer.tokenize(s)\n", " print('original:', s)\n", " print('corrected:', model_gpt2.correct_text(' '.join(tokenized)))\n", " print()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }