{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/preprocessing](https://github.com/huseinzol05/Malaya/tree/master/example/preprocessing).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.74 s, sys: 3.78 s, total: 6.52 s\n", "Wall time: 1.95 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Available rules\n", "\n", "We know that social media texts from Twitter, Facebook and Instagram are very noisy and we want to clean as much as possible to make our machines understand the structure of sentence much better. In Malaya, we standardize our text preprocessing,\n", "\n", "1. Malaya can replace special words into tokens to reduce dimension curse. `rm10k` become ``.\n", "3. Malaya can put tags for special words, `#drmahathir` become ` drmahathir `.\n", "4. Malaya can expand english contractions.\n", "5. Expand hashtags, `#drmahathir` become `dr mahathir`, required a segmentation callable.\n", "6. Malaya can put emoji tags if provide `demoji` object.\n", "\n", "#### normalize\n", "\n", "Supported `normalize`,\n", "\n", "1. hashtag\n", "2. cashtag\n", "3. tag\n", "4. user\n", "5. emphasis\n", "6. censored\n", "7. acronym\n", "8. eastern_emoticons\n", "9. rest_emoticons\n", "10. emoji\n", "11. quotes\n", "12. percent\n", "13. repeat_puncts\n", "14. money\n", "15. email\n", "16. phone\n", "17. number\n", "18. allcaps\n", "19. url\n", "20. date\n", "21. time\n", "\n", "You can check all supported list at `malaya.preprocessing.get_normalize()`.\n", "\n", "Example, if you set `money` and `number`, and input string is `RM10k`, the output is ``.\n", "\n", "#### annotate\n", "\n", "Supported `annotate`,\n", "\n", "1. hashtag\n", "2. allcaps\n", "3. elongated\n", "4. repeated\n", "5. emphasis\n", "6. censored\n", "\n", "Example, if you set `hashtag`, and input string is `#drmahathir`, the output is ` drmahathir `." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "string_1 = 'CANT WAIT for the new season of #mahathirmohamad \(^o^)/!!! #davidlynch #tvseries :))), TAAAK SAAABAAR!!!'\n", "string_2 = 'kecewanya #johndoe movie and it suuuuucks!!! WASTED RM10... rm10 #badmovies :/'\n", "string_3 = \"@husein: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY !!! :-D http://sentimentsymposium.com/.\"\n", "string_4 = 'aahhh, malasnye nak pegi keje harini #mondayblues'\n", "string_5 = '#drmahathir #najibrazak #1malaysia #mahathirnajib'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preprocessing Interface\n", "\n", "```python\n", "def preprocessing(\n", " normalize: List[str] = [\n", " 'url',\n", " 'email',\n", " 'percent',\n", " 'money',\n", " 'phone',\n", " 'user',\n", " 'time',\n", " 'date',\n", " 'number',\n", " ],\n", " annotate: List[str] = [\n", " 'allcaps',\n", " 'elongated',\n", " 'repeated',\n", " 'emphasis',\n", " 'censored',\n", " 'hashtag',\n", " ],\n", " lowercase: bool = True,\n", " fix_unidecode: bool = True,\n", " expand_english_contractions: bool = True,\n", " segmenter: Callable = None,\n", " demoji: Callable = None,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load Preprocessing class.\n", "\n", " Parameters\n", " ----------\n", " normalize: List[str], optional (default=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'])\n", " normalizing tokens, can check all supported normalizing at `malaya.preprocessing.get_normalize()`.\n", " annotate: List[str], optional (default=['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'])\n", " annonate tokens ,\n", " only accept ['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'].\n", " lowercase: bool, optional (default=True)\n", " fix_unidecode: bool, optional (default=True)\n", " fix unidecode using `ftfy.fix_text`.\n", " expand_english_contractions: bool, optional (default=True)\n", " expand english contractions.\n", " segmenter: Callable, optional (default=None)\n", " function to segmentize word.\n", " If provide, it will expand hashtags, #mondayblues == monday blues\n", " demoji: object\n", " demoji object, need to have a method `demoji`.\n", "\n", " Returns\n", " -------\n", " result : malaya.preprocessing.Preprocessing class\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load default paramaters\n", "\n", "default parameters able to translate most of english to bahasa malaysia." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 15.3 ms, sys: 0 ns, total: 15.3 ms\n", "Wall time: 15.2 ms\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/preprocessing.py:41: FutureWarning: Possible nested set at position 42\n", " k.lower(): re.compile(_expressions[k]) for k, v in _expressions.items()\n", "/home/husein/dev/malaya/malaya/preprocessing.py:41: FutureWarning: Possible nested set at position 3\n", " k.lower(): re.compile(_expressions[k]) for k, v in _expressions.items()\n" ] } ], "source": [ "%%time\n", "preprocessing = malaya.preprocessing.preprocessing()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.3 ms, sys: 0 ns, total: 1.3 ms\n", "Wall time: 1.31 ms\n" ] }, { "data": { "text/plain": [ "' CANT WAIT untuk the new season of #mahathirmohamad \\\\(^o^)/ ! #davidlynch #tvseries :))) , TAAAK SAAABAAR ! '" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_1))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 279 µs, sys: 0 ns, total: 279 µs\n", "Wall time: 280 µs\n" ] }, { "data": { "text/plain": [ "'kecewanya #johndoe movie and it suuuuucks ! WASTED . #badmovies :/'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_2))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 253 µs, sys: 0 ns, total: 253 µs\n", "Wall time: 254 µs\n" ] }, { "data": { "text/plain": [ "' : can not wait untuk the #Sentiment talks ! YAAAAAAY ! :-D '" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_3))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 160 µs, sys: 0 ns, total: 160 µs\n", "Wall time: 161 µs\n" ] }, { "data": { "text/plain": [ "'aahhh , malasnye nak pergi kerja hari ini #mondayblues '" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_4))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 200 µs, sys: 0 ns, total: 200 µs\n", "Wall time: 201 µs\n" ] }, { "data": { "text/plain": [ "' #drmahathir #najibrazak #1malaysia #mahathirnajib '" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load default paramaters with segmenter to expand hashtags.\n", "\n", "We saw ` drmahathir najibrazak `, we want to expand to become `dr mahathir` and `najib razak`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n", "You are using the default legacy behaviour of the . If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n" ] } ], "source": [ "segmenter = malaya.segmentation.huggingface()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.\n" ] }, { "data": { "text/plain": [ "'hello suka'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "segmenter_func = lambda x: segmenter.generate([x], max_length = 100)[0]\n", "segmenter_func('hellosuka')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 135 µs, sys: 122 µs, total: 257 µs\n", "Wall time: 263 µs\n" ] } ], "source": [ "%%time\n", "preprocessing = malaya.preprocessing.preprocessing(segmenter = segmenter_func)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.29 s, sys: 4.45 ms, total: 1.3 s\n", "Wall time: 116 ms\n" ] }, { "data": { "text/plain": [ "' CANT WAIT untuk the new season of mahathir mohamad \\\\(^o^)/ ! david lynch tv series :))) , TAAAK SAAABAAR ! '" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_1))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.14 s, sys: 3.57 ms, total: 1.15 s\n", "Wall time: 102 ms\n" ] }, { "data": { "text/plain": [ "'kecewanya john doe movie and it suuuuucks ! WASTED . bad movies :/'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_2))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 460 ms, sys: 0 ns, total: 460 ms\n", "Wall time: 43.7 ms\n" ] }, { "data": { "text/plain": [ "' : can not wait untuk the Sentiment talks ! YAAAAAAY ! :-D '" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_3))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 544 ms, sys: 0 ns, total: 544 ms\n", "Wall time: 49.3 ms\n" ] }, { "data": { "text/plain": [ "'aahhh , malasnye nak pergi kerja hari ini isnin blues '" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_4))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.75 s, sys: 0 ns, total: 1.75 s\n", "Wall time: 154 ms\n" ] }, { "data": { "text/plain": [ "' dr mahathir najib razak 1malaysia mahathir najib '" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "' '.join(preprocessing.process(string_5))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }