{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Spelling Correction using probability LM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/spelling-correction-probability-lm](https://github.com/huseinzol05/Malaya/tree/master/example/spelling-correction-probability-lm).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This spelling correction extends the functionality of the Peter Norvig's spell-corrector in http://norvig.com/spell-correct.html with KenLM language model.\n", "\n", "And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews,\n", "https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews\n", "\n", "Also added custom vowels augmentation." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''\n", "os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", " warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmphlkljyxm\n", "INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmphlkljyxm/_remote_module_non_scriptable.py\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "import malaya" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# some text examples copied from Twitter\n", "\n", "string1 = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'\n", "string2 = 'Husein ska mkn aym dkat kampng Jawa'\n", "string3 = 'Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.'\n", "string4 = 'Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.'\n", "string5 = 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'\n", "string6 = 'blh bntg dlm kls nlp sy, nnti intch'\n", "string7 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load probability model\n", "\n", "```python\n", "def load(\n", " language_model=None,\n", " sentence_piece: bool = False,\n", " stemmer=None,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load a Probability Spell Corrector.\n", "\n", " Parameters\n", " ----------\n", " language_model: Callable, optional (default=None)\n", " If not None, must an object with `score` method.\n", " sentence_piece: bool, optional (default=False)\n", " if True, reduce possible augmentation states using sentence piece.\n", " stemmer: Callable, optional (default=None)\n", " a Callable object, must have `stem_word` method.\n", "\n", " Returns\n", " -------\n", " result: model\n", " List of model classes:\n", "\n", " * if passed `language_model` will return `malaya.spelling_correction.probability.ProbabilityLM`.\n", " * else will return `malaya.spelling_correction.probability.Probability`.\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm = malaya.language_model.kenlm()\n", "lm" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json\n" ] } ], "source": [ "model = malaya.spelling_correction.probability.load(language_model = lm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### List possible generated pool of words\n", "\n", "```python\n", "def edit_candidates(self, word):\n", " \"\"\"\n", " Generate candidates given a word.\n", "\n", " Parameters\n", " ----------\n", " word: str\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['mahathir']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.edit_candidates('mhthir')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sumbing',\n", " 'sombong',\n", " 'sembing',\n", " 'simbang',\n", " 'sambang',\n", " 'sumbang',\n", " 'sembang',\n", " 'sambong',\n", " 'sembung',\n", " 'sembong',\n", " 'sambung']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.edit_candidates('smbng')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To correct a word\n", "\n", "```python\n", "def correct(\n", " self,\n", " word: str,\n", " string: List[str],\n", " index: int = -1,\n", " lookback: int = 3,\n", " lookforward: int = 3,\n", "):\n", " \"\"\"\n", " Correct a word within a text, returning the corrected word.\n", "\n", " Parameters\n", " ----------\n", " word: str\n", " string: str\n", " Entire string, `word` must a word inside `string`.\n", " index: int, optional (default=-1)\n", " index of word in the string, if -1, will try to use `string.index(word)`.\n", " lookback: int, optional (default=3)\n", " N left hand side words.\n", " lookforward: int, optional (default=3)\n", " N right hand side words.\n", "\n", " Returns\n", " -------\n", " result: str\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'kpd'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splitted = string1.split()\n", "model.correct('kpd', splitted)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "'kerajaan'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.correct('krajaan', splitted)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.05 ms, sys: 0 ns, total: 6.05 ms\n", "Wall time: 5.92 ms\n" ] }, { "data": { "text/plain": [ "'sikit'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.correct('skt', splitted, )" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.25 ms, sys: 341 µs, total: 4.59 ms\n", "Wall time: 4.43 ms\n" ] }, { "data": { "text/plain": [ "'sikit'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.correct('skt', splitted, lookback = -1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To correct a sentence\n", "\n", "```python\n", "def correct_text(\n", " self,\n", " text: str,\n", " lookback: int = 3,\n", " lookforward: int = 3,\n", "):\n", " \"\"\"\n", " Correct all the words within a text, returning the corrected text.\n", "\n", " Parameters\n", " ----------\n", " text: str\n", " lookback: int, optional (default=3)\n", " N words on the left hand side.\n", " if put -1, will take all words on the left hand side.\n", " longer left hand side will take longer to compute.\n", " lookforward: int, optional (default=3)\n", " N words on the right hand side.\n", " if put -1, will take all words on the right hand side.\n", " longer right hand side will take longer to compute.\n", "\n", " Returns\n", " -------\n", " result: str\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'kerajaan patut bagi pencen awal sikit kpd warga emas supaya emosi'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.correct_text(string1)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "tokenizer = malaya.tokenizer.Tokenizer()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Husin ska makan ayam dekat kampung Jawa'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string2)\n", "model.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Melayu malas ni narration dia sama je macam men are trash . True to some , false to some .'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string3)\n", "model.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Tapi tak pikir ke bahaya perpetuate myths camtu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah . Your kids will be victims of that too .'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string4)\n", "model.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now has i am edging towards retirement ini 4 - 5 years time after a career of being ini Engineer , Project Manager , General Manager'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string5)\n", "model.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'blh bintang dlm kelas nlp saya , nnti intch'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string6)\n", "model.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'mulakan slh org boleh , bila geng tuh kena salahkan jgk xboleh trima . . pelik'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string7)\n", "model.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "s = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik , dia slhkn org bole hri2 crta sakau then bila kna bls balik xdpt jwb ,kata mcm biasa slh (parti sampah) 🤣🤣🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣🤣🤣'" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'mulakan slh org boleh , bila geng tuh kena salahkan jgk xboleh trima . . pelik , dia salahkan org bole hri2 cerita sakau then bila kena bilas balik xdpt jwb , kata mcm biasa slh ( parti sampah ) 🤣 🤣 🤣 jgn mulakan dlu salahkan org kalau xboleh trima bila kena bilas balik 🤣 🤣 🤣'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(s)\n", "model.correct_text(' '.join(tokenized))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load stemmer for probability model\n", "\n", "By default kata imbuhan captured using naive regex pattern without understand the word structure, and problem with that, there are so many rules need to hardcode, so we can use better stemmer model like `malaya.stem.huggingface()`." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/stem-lstm-512/model.pt\n", "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" ] } ], "source": [ "stemmer = malaya.stem.huggingface()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json\n" ] } ], "source": [ "model_stemmer = malaya.spelling_correction.probability.load(language_model = lm, stemmer = stemmer)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.\n" ] }, { "data": { "text/plain": [ "'mulakan slh org boleh , bila geng tuh kena salahkan jgk xboleh trima . . pelik'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string7)\n", "model_stemmer.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "s = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik , dia slhkn org bole hri2 crta sakau then bila kna bls balik xdpt jwb ,kata mcm biasa slh (parti sampah) 🤣🤣🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣🤣🤣'" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'mulakan slh org boleh , bila geng tuh kena salahkan jgk xboleh trima . . pelik , dia salahkan org bole hri2 cerita sakau then bila kena bilas balik xdpt jwb , kata mcm biasa slh ( parti sampah ) 🤣 🤣 🤣 jgn mulakan dlu salahkan org kalau xboleh trima bila kena bilas balik 🤣 🤣 🤣'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(s)\n", "model_stemmer.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }