\n",
"\n",
"This tutorial is available as an IPython notebook at [Malaya/example/preprocessing](https://github.com/huseinzol05/Malaya/tree/master/example/preprocessing).\n",
" \n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 3.5 s, sys: 2.3 s, total: 5.8 s\n",
"Wall time: 2.98 s\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ubuntu/dev/malaya/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361\n",
" self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
"/home/ubuntu/dev/malaya/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879\n",
" self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
]
}
],
"source": [
"%%time\n",
"import malaya"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Available rules\n",
"\n",
"We know that social media texts from Twitter, Facebook and Instagram are very noisy and we want to clean as much as possible to make our machines understand the structure of sentence much better. In Malaya, we standardize our text preprocessing,\n",
"\n",
"1. Malaya can replace special words into tokens to reduce dimension curse. `rm10k` become ``.\n",
"3. Malaya can put tags for special words, `#drmahathir` become ` drmahathir `.\n",
"4. Malaya can expand english contractions.\n",
"5. Expand hashtags, `#drmahathir` become `dr mahathir`, required a segmentation callable.\n",
"6. Malaya can put emoji tags if provide `demoji` object.\n",
"\n",
"#### normalize\n",
"\n",
"Supported `normalize`,\n",
"\n",
"1. hashtag\n",
"2. cashtag\n",
"3. tag\n",
"4. user\n",
"5. emphasis\n",
"6. censored\n",
"7. acronym\n",
"8. eastern_emoticons\n",
"9. rest_emoticons\n",
"10. emoji\n",
"11. quotes\n",
"12. percent\n",
"13. repeat_puncts\n",
"14. money\n",
"15. email\n",
"16. phone\n",
"17. number\n",
"18. allcaps\n",
"19. url\n",
"20. date\n",
"21. time\n",
"\n",
"You can check all supported list at `malaya.preprocessing.get_normalize()`.\n",
"\n",
"Example, if you set `money` and `number`, and input string is `RM10k`, the output is ``.\n",
"\n",
"#### annotate\n",
"\n",
"Supported `annotate`,\n",
"\n",
"1. hashtag\n",
"2. allcaps\n",
"3. elongated\n",
"4. repeated\n",
"5. emphasis\n",
"6. censored\n",
"\n",
"Example, if you set `hashtag`, and input string is `#drmahathir`, the output is ` drmahathir `."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"string_1 = 'CANT WAIT for the new season of #mahathirmohamad \(^o^)/!!! #davidlynch #tvseries :))), TAAAK SAAABAAR!!!'\n",
"string_2 = 'kecewanya #johndoe movie and it suuuuucks!!! WASTED RM10... rm10 #badmovies :/'\n",
"string_3 = \"@husein: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY !!! :-D http://sentimentsymposium.com/.\"\n",
"string_4 = 'aahhh, malasnye nak pegi keje harini #mondayblues'\n",
"string_5 = '#drmahathir #najibrazak #1malaysia #mahathirnajib'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preprocessing Interface\n",
"\n",
"```python\n",
"def preprocessing(\n",
" normalize: List[str] = [\n",
" 'url',\n",
" 'email',\n",
" 'percent',\n",
" 'money',\n",
" 'phone',\n",
" 'user',\n",
" 'time',\n",
" 'date',\n",
" 'number',\n",
" ],\n",
" annotate: List[str] = [\n",
" 'allcaps',\n",
" 'elongated',\n",
" 'repeated',\n",
" 'emphasis',\n",
" 'censored',\n",
" 'hashtag',\n",
" ],\n",
" lowercase: bool = True,\n",
" fix_unidecode: bool = True,\n",
" expand_english_contractions: bool = True,\n",
" segmenter: Callable = None,\n",
" demoji: Callable = None,\n",
" **kwargs,\n",
"):\n",
" \"\"\"\n",
" Load Preprocessing class.\n",
"\n",
" Parameters\n",
" ----------\n",
" normalize: List[str], optional (default=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'])\n",
" normalizing tokens, can check all supported normalizing at `malaya.preprocessing.get_normalize()`.\n",
" annotate: List[str], optional (default=['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'])\n",
" annonate tokens ,\n",
" only accept ['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'].\n",
" lowercase: bool, optional (default=True)\n",
" fix_unidecode: bool, optional (default=True)\n",
" fix unidecode using `ftfy.fix_text`.\n",
" expand_english_contractions: bool, optional (default=True)\n",
" expand english contractions.\n",
" segmenter: Callable, optional (default=None)\n",
" function to segmentize word.\n",
" If provide, it will expand hashtags, #mondayblues == monday blues\n",
" demoji: object\n",
" demoji object, need to have a method `demoji`.\n",
"\n",
" Returns\n",
" -------\n",
" result : malaya.preprocessing.Preprocessing class\n",
" \"\"\"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load default paramaters\n",
"\n",
"default parameters able to translate most of english to bahasa malaysia."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 28.3 ms, sys: 68 µs, total: 28.4 ms\n",
"Wall time: 28.2 ms\n"
]
}
],
"source": [
"%%time\n",
"preprocessing = malaya.preprocessing.preprocessing()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 576 µs, sys: 1.45 ms, total: 2.02 ms\n",
"Wall time: 2.03 ms\n"
]
},
{
"data": {
"text/plain": [
"' CANT WAIT untuk the new season of #mahathirmohamad \\\\(^o^)/ ! #davidlynch #tvseries , TAAAK SAAABAAR ! '"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"' '.join(preprocessing.process(string_1))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 466 µs, sys: 0 ns, total: 466 µs\n",
"Wall time: 470 µs\n"
]
},
{
"data": {
"text/plain": [
"'kecewanya #johndoe movie and it suuuuucks ! WASTED . #badmovies '"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"' '.join(preprocessing.process(string_2))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 408 µs, sys: 0 ns, total: 408 µs\n",
"Wall time: 412 µs\n"
]
},
{
"data": {
"text/plain": [
"' : can not wait untuk the #Sentiment talks ! YAAAAAAY ! '"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"' '.join(preprocessing.process(string_3))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 214 µs, sys: 0 ns, total: 214 µs\n",
"Wall time: 217 µs\n"
]
},
{
"data": {
"text/plain": [
"'aahhh , malasnye nak pergi kerja hari ini #mondayblues '"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"' '.join(preprocessing.process(string_4))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 152 µs, sys: 90 µs, total: 242 µs\n",
"Wall time: 244 µs\n"
]
},
{
"data": {
"text/plain": [
"' #drmahathir #najibrazak #1malaysia #mahathirnajib '"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"' '.join(preprocessing.process(string_5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load default paramaters with segmenter to expand hashtags.\n",
"\n",
"We saw ` drmahathir najibrazak `, we want to expand to become `dr mahathir` and `najib razak`."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"segmenter = malaya.segmentation.transformer(model = 'small', quantized = True)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-09-01 15:22:19.953149: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.\n"
]
},
{
"data": {
"text/plain": [
"'hello suka'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segmenter_func = lambda x: segmenter.greedy_decoder([x])[0]\n",
"segmenter_func('hellosuka')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 69 µs, sys: 0 ns, total: 69 µs\n",
"Wall time: 73 µs\n"
]
}
],
"source": [
"%%time\n",
"preprocessing = malaya.preprocessing.preprocessing(segmenter = segmenter_func)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 113 ms, sys: 0 ns, total: 113 ms\n",
"Wall time: 166 ms\n"
]
},
{
"data": {
"text/plain": [
"' CANT WAIT untuk the new season of mahathir mohamad \\\\(^o^)/ ! davidlynch tv series , TAAAK SAAABAAR ! '"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"' '.join(preprocessing.process(string_1))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 62 ms, sys: 3.12 ms, total: 65.2 ms\n",
"Wall time: 90.7 ms\n"
]
},
{
"data": {
"text/plain": [
"'kecewanya johndoe movie and it suuuuucks ! WASTED . bad movies '"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"' '.join(preprocessing.process(string_2))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 27.7 ms, sys: 3.83 ms, total: 31.5 ms\n",
"Wall time: 45.1 ms\n"
]
},
{
"data": {
"text/plain": [
"' : can not wait untuk the Sentiment talks ! YAAAAAAY ! '"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"' '.join(preprocessing.process(string_3))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 37.9 ms, sys: 6.61 ms, total: 44.6 ms\n",
"Wall time: 66.3 ms\n"
]
},
{
"data": {
"text/plain": [
"'aahhh , malasnye nak pergi kerja hari ini mondayblues '"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"' '.join(preprocessing.process(string_4))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 123 ms, sys: 6.86 ms, total: 129 ms\n",
"Wall time: 182 ms\n"
]
},
{
"data": {
"text/plain": [
"' dr mahathir najib razak 1malaysia mahathir najib '"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"' '.join(preprocessing.process(string_5))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}