{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/preprocessing](https://github.com/huseinzol05/Malaya/tree/master/example/preprocessing).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.56 s, sys: 1.39 s, total: 7.95 s\n", "Wall time: 9.68 s\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Available rules\n", "\n", "We know that social media texts from Twitter, Facebook and Instagram are very noisy and we want to clean as much as possible to make our machines understand the structure of sentence much better. In Malaya, we standardize our text preprocessing,\n", "\n", "1. Malaya can replace special words into tokens to reduce dimension curse. `rm10k` become ``.\n", "3. Malaya can put tags for special words, `#drmahathir` become ` drmahathir `.\n", "4. Malaya can expand english contractions.\n", "5. Malaya can translate EN words to become MS words. required a translator callable.\n", "6. Stemming and lemmatizing, required a stemmer callable.\n", "7. Normalize elongated words, required a Malaya speller callable.\n", "8. Expand hashtags, `#drmahathir` become `dr mahathir`, required a segmentation callable.\n", "9. Malaya can put emoji tags if provide `demoji` object.\n", "\n", "#### normalize\n", "\n", "Supported `normalize`,\n", "\n", "1. hashtag\n", "2. cashtag\n", "3. tag\n", "4. user\n", "5. emphasis\n", "6. censored\n", "7. acronym\n", "8. eastern_emoticons\n", "9. rest_emoticons\n", "10. emoji\n", "11. quotes\n", "12. percent\n", "13. repeat_puncts\n", "14. money\n", "15. email\n", "16. phone\n", "17. number\n", "18. allcaps\n", "19. url\n", "20. date\n", "21. time\n", "\n", "You can check all supported list at `malaya.preprocessing.get_normalize()`.\n", "\n", "Example, if you set `money` and `number`, and input string is `RM10k`, the output is ``.\n", "\n", "#### annotate\n", "\n", "Supported `annotate`,\n", "\n", "1. hashtag\n", "2. allcaps\n", "3. elongated\n", "4. repeated\n", "5. emphasis\n", "6. censored\n", "\n", "Example, if you set `hashtag`, and input string is `#drmahathir`, the output is ` drmahathir `." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "string_1 = 'CANT WAIT for the new season of #mahathirmohamad \(^o^)/!!! #davidlynch #tvseries :))), TAAAK SAAABAAR!!!'\n", "string_2 = 'kecewanya #johndoe movie and it suuuuucks!!! WASTED RM10... rm10 #badmovies :/'\n", "string_3 = \"@husein: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY !!! :-D http://sentimentsymposium.com/.\"\n", "string_4 = 'aahhh, malasnye nak pegi keje harini #mondayblues'\n", "string_5 = '#drmahathir #najibrazak #1malaysia #mahathirnajib'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preprocessing Interface\n", "\n", "```python\n", "def preprocessing(\n", " normalize: List[str] = [\n", " 'url',\n", " 'email',\n", " 'percent',\n", " 'money',\n", " 'phone',\n", " 'user',\n", " 'time',\n", " 'date',\n", " 'number',\n", " ],\n", " annotate: List[str] = [\n", " 'allcaps',\n", " 'elongated',\n", " 'repeated',\n", " 'emphasis',\n", " 'censored',\n", " 'hashtag',\n", " ],\n", " lowercase: bool = True,\n", " fix_unidecode: bool = True,\n", " expand_english_contractions: bool = True,\n", " translator: Callable = None,\n", " segmenter: Callable = None,\n", " stemmer: Callable = None,\n", " speller: Callable = None,\n", " demoji: Callable = None,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load Preprocessing class.\n", "\n", " Parameters\n", " ----------\n", " normalize: List[str], optional (default=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'])\n", " normalizing tokens, can check all supported normalizing at `malaya.preprocessing.get_normalize()`.\n", " annotate: List[str], optional (default=['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'])\n", " annonate tokens ,\n", " only accept ['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'].\n", " lowercase: bool, optional (default=True)\n", " fix_unidecode: bool, optional (default=True)\n", " fix unidecode using `ftfy.fix_text`.\n", " expand_english_contractions: bool, optional (default=True)\n", " expand english contractions.\n", " translator: Callable, optional (default=None)\n", " function to translate EN word to MS word.\n", " segmenter: Callable, optional (default=None)\n", " function to segmentize word.\n", " If provide, it will expand hashtags, #mondayblues == monday blues\n", " stemmer: Callable, optional (default=None)\n", " function to stem word.\n", " speller: object\n", " spelling correction object, need to have a method `correct` or `normalize_elongated`.\n", " demoji: object\n", " demoji object, need to have a method `demoji`.\n", "\n", " Returns\n", " -------\n", " result : malaya.preprocessing.Preprocessing class\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load default paramaters\n", "\n", "default parameters able to translate most of english to bahasa malaysia." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 211 ms, sys: 3.78 ms, total: 215 ms\n", "Wall time: 217 ms\n" ] } ], "source": [ "%%time\n", "preprocessing = malaya.preprocessing.preprocessing()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.85 ms, sys: 55 µs, total: 2.91 ms\n", "Wall time: 2.95 ms\n" ] }, { "data": { "text/plain": [ "' tak boleh wait untuk the new season of mahathirmohamad \\\\(^o^)/ ! davidlynch tvseries , taak saabaar ! '" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_1))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 532 µs, sys: 6 µs, total: 538 µs\n", "Wall time: 552 µs\n" ] }, { "data": { "text/plain": [ "'kecewanya johndoe movie and it suucks ! wasted . badmovies '" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_2))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 595 µs, sys: 9 µs, total: 604 µs\n", "Wall time: 619 µs\n" ] }, { "data": { "text/plain": [ "' : can not wait untuk the sentiment talks ! yaay ! :-d '" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_3))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 452 µs, sys: 13 µs, total: 465 µs\n", "Wall time: 491 µs\n" ] }, { "data": { "text/plain": [ "' aahh , malasnye nak pergi kerja hari ini mondayblues '" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_4))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 396 µs, sys: 1e+03 ns, total: 397 µs\n", "Wall time: 401 µs\n" ] }, { "data": { "text/plain": [ "' drmahathir najibrazak 1 malaysia mahathirnajib '" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load default paramaters with spelling correction to normalize elongated words.\n", "\n", "We saw `taak`, `saabaar` and another elongated words are not the original words, so we can use spelling correction to normalize it." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "corrector = malaya.spell.probability()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 181 µs, sys: 19 µs, total: 200 µs\n", "Wall time: 220 µs\n" ] } ], "source": [ "%%time\n", "preprocessing = malaya.preprocessing.preprocessing(speller = corrector)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 909 µs, sys: 16 µs, total: 925 µs\n", "Wall time: 937 µs\n" ] }, { "data": { "text/plain": [ "' tak boleh wait untuk the new season of mahathirmohamad \\\\(^o^)/ ! davidlynch tvseries , tidak sabar ! '" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_1))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 701 µs, sys: 9 µs, total: 710 µs\n", "Wall time: 719 µs\n" ] }, { "data": { "text/plain": [ "'kecewanya johndoe movie and it sucks ! wasted . badmovies '" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_2))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 560 µs, sys: 7 µs, total: 567 µs\n", "Wall time: 575 µs\n" ] }, { "data": { "text/plain": [ "' : can not wait untuk the sentiment talks ! yay ! :-d '" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_3))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 456 µs, sys: 1 µs, total: 457 µs\n", "Wall time: 463 µs\n" ] }, { "data": { "text/plain": [ "' ah , malasnye nak pergi kerja hari ini mondayblues '" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_4))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 353 µs, sys: 1 µs, total: 354 µs\n", "Wall time: 357 µs\n" ] }, { "data": { "text/plain": [ "' drmahathir najibrazak 1 malaysia mahathirnajib '" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load default paramaters with segmenter to expand hashtags.\n", "\n", "We saw ` drmahathir najibrazak `, we want to expand to become `dr mahathir` and `najib razak`." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "segmenter = malaya.segmentation.transformer(model = 'small', quantized = True)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'hello suka'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segmenter_func = lambda x: segmenter.greedy_decoder([x])[0]\n", "segmenter_func('hellosuka')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 179 µs, sys: 6 µs, total: 185 µs\n", "Wall time: 194 µs\n" ] } ], "source": [ "%%time\n", "preprocessing = malaya.preprocessing.preprocessing(segmenter = segmenter_func)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 336 ms, sys: 65.9 ms, total: 402 ms\n", "Wall time: 170 ms\n" ] }, { "data": { "text/plain": [ "' tak boleh wait untuk the new season of mahathir mohamad \\\\(^o^)/ ! davidlynch tv series , taak saabaar ! '" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_1))" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 189 ms, sys: 36.4 ms, total: 225 ms\n", "Wall time: 86.9 ms\n" ] }, { "data": { "text/plain": [ "'kecewanya johndoe movie and it suucks ! wasted . bad movies '" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_2))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 82.2 ms, sys: 20.4 ms, total: 103 ms\n", "Wall time: 45.9 ms\n" ] }, { "data": { "text/plain": [ "' : can not wait untuk the sentiment talks ! yaay ! :-d '" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_3))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 133 ms, sys: 29.9 ms, total: 163 ms\n", "Wall time: 69.1 ms\n" ] }, { "data": { "text/plain": [ "' aahh , malasnye nak pergi kerja hari ini mondayblues '" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_4))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 373 ms, sys: 73.7 ms, total: 447 ms\n", "Wall time: 177 ms\n" ] }, { "data": { "text/plain": [ "' dr mahathir najib razak 1 malaysia mahathir najib '" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load default paramaters with stemming and lemmatization" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "sastrawi = malaya.stem.sastrawi()\n", "stemmer_func = lambda x: sastrawi.stem(x)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'suka'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stemmer_func('sukakan')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 210 µs, sys: 11 µs, total: 221 µs\n", "Wall time: 227 µs\n" ] } ], "source": [ "%%time\n", "preprocessing = malaya.preprocessing.preprocessing(stemmer = stemmer_func)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9.15 ms, sys: 190 µs, total: 9.34 ms\n", "Wall time: 9.39 ms\n" ] }, { "data": { "text/plain": [ "' tak boleh wait untuk the new season of mahathirmohamad o davidlynch tvseries taak saabaar '" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_1))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.85 ms, sys: 1e+03 ns, total: 1.85 ms\n", "Wall time: 1.86 ms\n" ] }, { "data": { "text/plain": [ "'kecewa johndoe movie and it suucks wasted badmovies '" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_2))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.66 ms, sys: 1e+03 ns, total: 1.66 ms\n", "Wall time: 1.67 ms\n" ] }, { "data": { "text/plain": [ "' can not wait untuk the sentiment talks yaay -d '" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_3))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['',\n", " 'can',\n", " 'not',\n", " 'wait',\n", " 'untuk',\n", " 'the',\n", " '',\n", " '',\n", " 'sentiment',\n", " '',\n", " 'talks',\n", " '',\n", " '',\n", " 'yaay',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '-d',\n", " '']" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preprocessing.process(string_3)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.78 ms, sys: 15 µs, total: 1.8 ms\n", "Wall time: 1.82 ms\n" ] }, { "data": { "text/plain": [ "' aahh malasnye nak pergi kerja hari ini mondayblues '" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_4))" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.01 ms, sys: 11 µs, total: 2.02 ms\n", "Wall time: 2.03 ms\n" ] }, { "data": { "text/plain": [ "' drmahathir najibrazak 1 malaysia mahathirnajib '" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_5))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.18 ms, sys: 20 µs, total: 2.2 ms\n", "Wall time: 2.23 ms\n" ] }, { "data": { "text/plain": [ "'saya sini jalan pergi ke putrajaya masjidbesi '" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process('saya disini berjalan pergi ke putrajaya, #masjidbesi'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load translation" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('kesakitan', 'aduh')" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "en_ms_vocab = malaya.translation.en_ms.dictionary()\n", "translator = lambda x: en_ms_vocab.get(x, x)\n", "translator('pain'), translator('aduh')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 121 µs, sys: 10 µs, total: 131 µs\n", "Wall time: 135 µs\n" ] } ], "source": [ "%%time\n", "preprocessing = malaya.preprocessing.preprocessing(translator = translator)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.09 ms, sys: 63 µs, total: 1.15 ms\n", "Wall time: 1.51 ms\n" ] }, { "data": { "text/plain": [ "' tak boleh tunggu untuk yang baru musim daripada mahathirmohamad \\\\(^o^)/ ! davidlynch tvseries , taak saabaar ! '" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_1))" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 550 µs, sys: 8 µs, total: 558 µs\n", "Wall time: 563 µs\n" ] }, { "data": { "text/plain": [ "'kecewanya johndoe filem dan ia suucks ! dibazirkan . badmovies '" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_2))" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 690 µs, sys: 22 µs, total: 712 µs\n", "Wall time: 759 µs\n" ] }, { "data": { "text/plain": [ "' : boleh tidak tunggu untuk yang sentimen talks ! yaay ! :-d '" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use Neural Translation Machine\n", "\n", "Problem with dictionary based, if the words is not exist, the translation will not work," ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('love', 'them', 'kesakitan')" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "translator('love'), translator('them'), translator('pain')" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "nmt = malaya.translation.en_ms.transformer(model = 'small')\n", "nmt_func = lambda x: nmt.greedy_decoder([x])[0]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('cinta', 'mereka', 'kesakitan')" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nmt_func('love'), nmt_func('them'), nmt_func('pain')" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 112 µs, sys: 4 µs, total: 116 µs\n", "Wall time: 119 µs\n" ] } ], "source": [ "%%time\n", "preprocessing = malaya.preprocessing.preprocessing(translator = nmt_func)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 277 ms, sys: 27.5 ms, total: 305 ms\n", "Wall time: 194 ms\n" ] }, { "data": { "text/plain": [ "' tak boleh tunggu untuk baru musim mahathirmohamad \\\\(^o^)/ ! davidlynch tvseries , taak abaar ! '" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_1))" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 149 ms, sys: 14.7 ms, total: 163 ms\n", "Wall time: 105 ms\n" ] }, { "data": { "text/plain": [ "'kecewanya johndoe filem dan ia bernasib baik ! disia . badmovie '" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_2))" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 130 ms, sys: 10.6 ms, total: 141 ms\n", "Wall time: 97.9 ms\n" ] }, { "data": { "text/plain": [ "' : boleh tidak tunggu untuk sentimen ceramah ! yaay ! :-d '" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_3))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }