{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Spelling Correction using encoder Transformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/spelling-correction-encoder-transformer](https://github.com/huseinzol05/Malaya/tree/master/example/spelling-correction-encoder-transformer).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:numexpr.utils:NumExpr defaulting to 8 threads.\n" ] } ], "source": [ "import malaya" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# some text examples copied from Twitter\n", "\n", "string1 = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'\n", "string2 = 'Husein ska mkn aym dkat kampng Jawa'\n", "string3 = 'Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.'\n", "string4 = 'Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.'\n", "string5 = 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'\n", "string6 = 'blh bntg dlm kls nlp sy, nnti intch'\n", "string7 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Encoder transformer speller\n", "\n", "This spelling correction is a transformer based, improvement version of `malaya.spelling_correction.probability.Probability`. Problem with `malaya.spelling_correction.probability.Probability`, it naively picked highest probability of word based on public sentences (wiki, news and social media) without understand actual context, example,\n", "\n", "```python\n", "string = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'\n", "prob_corrector = malaya.spelling_correction.probability.load()\n", "prob_corrector.correct_text(string)\n", "-> 'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It supposely replaced `skt` with `sikit`, a common word people use in social media to give a little bit of attention to `pencen`. So, to fix that, we can use Transformer model! \n", "\n", "**Right now transformer speller supported `BERT`, `ALBERT` and `ELECTRA` only**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "def encoder(model, sentence_piece: bool = False, **kwargs):\n", " \"\"\"\n", " Load a Transformer Encoder Spell Corrector. Right now only supported BERT, ALBERT and ELECTRA.\n", "\n", " Parameters\n", " ----------\n", " sentence_piece: bool, optional (default=False)\n", " if True, reduce possible augmentation states using sentence piece.\n", "\n", " Returns\n", " -------\n", " result: malaya.spelling_correction.transformer.Transformer class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v34-pretrained-model/electra-base.tar.gz\n", "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya/malaya/transformers/electra/modeling.py:242: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use keras.layers.Dense instead.\n", "WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Please use `layer.__call__` method instead.\n", "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya/malaya/transformers/sampling.py:26: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.where in 2.0, which has the same broadcast rule as np.where\n", "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya/malaya/transformers/electra/__init__.py:120: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use `tf.random.categorical` instead.\n", "INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/base/electra-base/model.ckpt\n" ] } ], "source": [ "model = malaya.transformer.load(model = 'electra')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/sp10m.cased.v4.vocab\n", "INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/sp10m.cased.v4.model\n", "INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json\n" ] } ], "source": [ "transformer_corrector = malaya.spelling_correction.transformer.encoder(model, sentence_piece = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To correct a word\n", "\n", "```python\n", "def correct(\n", " self,\n", " word: str,\n", " string: List[str],\n", " index: int = -1,\n", " lookback: int = 5,\n", " lookforward: int = 5,\n", " batch_size: int = 20,\n", "):\n", " \"\"\"\n", " Correct a word within a text, returning the corrected word.\n", "\n", " Parameters\n", " ----------\n", " word: str\n", " string: List[str]\n", " Tokenized string, `word` must a word inside `string`.\n", " index: int, optional (default=-1)\n", " index of word in the string, if -1, will try to use `string.index(word)`.\n", " lookback: int, optional (default=5)\n", " N words on the left hand side.\n", " if put -1, will take all words on the left hand side.\n", " longer left hand side will take longer to compute.\n", " lookforward: int, optional (default=5)\n", " N words on the right hand side.\n", " if put -1, will take all words on the right hand side.\n", " longer right hand side will take longer to compute.\n", " batch_size: int, optional (default=20)\n", " batch size to insert into model.\n", "\n", " Returns\n", " -------\n", " result: str\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'kepada'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splitted = string1.split()\n", "transformer_corrector.correct('kpd', splitted)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'kerajaan'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformer_corrector.correct('krajaan', splitted)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 19.2 s, sys: 837 ms, total: 20 s\n", "Wall time: 3.59 s\n" ] }, { "data": { "text/plain": [ "'sikit'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "transformer_corrector.correct('skt', splitted)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 18.8 s, sys: 917 ms, total: 19.7 s\n", "Wall time: 3.78 s\n" ] }, { "data": { "text/plain": [ "'sikit'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "transformer_corrector.correct('skt', splitted, lookback = -1)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 12.6 s, sys: 588 ms, total: 13.2 s\n", "Wall time: 2.37 s\n" ] }, { "data": { "text/plain": [ "'sikit'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "transformer_corrector.correct('skt', splitted, lookback = 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To correct a sentence\n", "\n", "```python\n", "def correct_text(\n", " self,\n", " text: str,\n", " lookback: int = 5,\n", " lookforward: int = 5,\n", " batch_size: int = 20\n", "):\n", " \"\"\"\n", " Correct all the words within a text, returning the corrected text.\n", "\n", " Parameters\n", " ----------\n", " text: str\n", " lookback: int, optional (default=5)\n", " N words on the left hand side.\n", " if put -1, will take all words on the left hand side.\n", " longer left hand side will take longer to compute.\n", " lookforward: int, optional (default=5)\n", " N words on the right hand side.\n", " if put -1, will take all words on the right hand side.\n", " longer right hand side will take longer to compute.\n", " batch_size: int, optional(default=20)\n", " batch size to insert into model.\n", "\n", " Returns\n", " -------\n", " result: str\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformer_corrector.correct_text(string1)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "tokenizer = malaya.tokenizer.Tokenizer()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Husein ska mkn aym dkat kampng Jawa'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string2" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Husein suka mkn ayam dikota kampung Jawa'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string2)\n", "transformer_corrector.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Melayu malas ini narration dia sama sahaja macam men are trash . True to some , false to some .'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string3)\n", "transformer_corrector.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string5)\n", "transformer_corrector.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'boleh buntong dalam kelas nlp saye , nanti intch'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string6)\n", "transformer_corrector.correct_text(' '.join(tokenized))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(string7)\n", "transformer_corrector.correct_text(' '.join(tokenized))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }