{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Stemmer and Lemmatization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/stemmer](https://github.com/huseinzol05/Malaya/tree/master/example/stemmer).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module only trained on standard language structure, so it is not save to use it for local language structure.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", " warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n", "CPU times: user 3.11 s, sys: 2.59 s, total: 5.7 s\n", "Wall time: 2.96 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "string = 'Benda yg SALAH ni, jgn lah didebatkan. Yg SALAH xkan jadi betul. Ingat tu. Mcm mana kesat sekalipun org sampaikan mesej, dan memang benda tu salah, diam je. Xyah nk tunjuk kau open sangat nk tegur cara org lain berdakwah'\n", "another_string = 'melayu bodoh, dah la gay, sokong lgbt lagi, memang tak guna, http://twitter.com @kesedihan rm15'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Naive stemmer\n", "\n", "Simply use regex pattern to do stemming. This method not able to lemmatize.\n", "\n", "```python\n", "def naive():\n", " \"\"\"\n", " Load stemming model using startswith and endswith naively using regex patterns.\n", "\n", " Returns\n", " -------\n", " result : malaya.stem.NAIVE class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "naive = malaya.stem.naive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Stem and lemmatization\n", "\n", "```python\n", "def stem(self, string: str):\n", " \"\"\"\n", " Stem a string using Regex pattern.\n", "\n", " Parameters\n", " ----------\n", " string : str\n", "\n", " Returns\n", " -------\n", " result: str\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/model/stem.py:28: FutureWarning: Possible nested set at position 3\n", " or re.findall(_expressions['ic'], word.lower())\n" ] }, { "data": { "text/plain": [ "'saya yerukan'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "naive.stem('saya menyerukanlah')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'arik'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "naive.stem('menarik')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'slh'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "naive.stem('slhlah')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Benda yg SALAH ni , jgn lah debat . Yg SALAH x jadi betul . Ingat tu . Mcm mana sat kalipun org sampai sej , dan ang benda tu sa , am je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "naive.stem(string)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'layu bodoh , dah la gay , sokong lgbt lagi , ang tak guna , http://twitter.com @kesedihan rm15'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "naive.stem(another_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Sastrawi stemmer\n", "\n", "Malaya also included interface for https://pypi.org/project/PySastrawi/. We use it for internal purpose. To use it, simply,\n", "\n", "```python\n", "def sastrawi():\n", " \"\"\"\n", " Load stemming model using Sastrawi, this also include lemmatization.\n", "\n", " Returns\n", " -------\n", " result: malaya.stem.SASTRAWI class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "sastrawi = malaya.stem.sastrawi()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Stem and lemmatization\n", "\n", "```python\n", "def stem(self, string: str):\n", " \"\"\"\n", " Stem a string using Sastrawi, this also include lemmatization.\n", "\n", " Parameters\n", " ----------\n", " string : str\n", "\n", " Returns\n", " -------\n", " result: str\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'saya seru'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sastrawi.stem('saya menyerukanlah')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'slhlah'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sastrawi.stem('slhlah')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sastrawi.stem(string)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'melayu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com @kesedihan rm15'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sastrawi.stem(another_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace models" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mesolitica/stem-lstm-512': {'Size (MB)': 35.2,\n", " 'hidden size': 512,\n", " 'CER': 0.02549779186652238,\n", " 'WER': 0.05448552235248484},\n", " 'mesolitica/stem-gru-bahdanau-1024': {'Size (MB)': 83.7,\n", " 'vocab size': 1000,\n", " 'hidden size': 1024,\n", " 'CER': 0.07082863511793107,\n", " 'WER': 0.11684768403456935}}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.stem.available_huggingface" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Trained on train set and tested on test set, https://github.com/huseinzol05/malay-dataset/tree/master/normalization/stemmer\n" ] } ], "source": [ "print(malaya.stem.info)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use HuggingFace model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "def huggingface(\n", " model: str = 'mesolitica/stem-lstm-512',\n", " force_check: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load HuggingFace model to stem and lemmatization.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='mesolitica/stem-lstm-512')\n", " Check available models at `malaya.stem.available_huggingface`.\n", " force_check: bool, optional (default=True)\n", " Force check model one of malaya model.\n", " Set to False if you have your own huggingface model.\n", "\n", " Returns\n", " -------\n", " result: malaya.torch_model.rnn.Stem\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" ] } ], "source": [ "model = malaya.stem.huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Stem and lemmatization\n", "\n", "```python\n", "def stem(self, string: str, beam_search: bool = True):\n", " \"\"\"\n", " Stem a string, this also include lemmatization.\n", "\n", " Parameters\n", " ----------\n", " string : str\n", " beam_search : bool, (optional=True)\n", " If True, use beam search decoder, else use greedy decoder.\n", "\n", " Returns\n", " -------\n", " result: str\n", " \"\"\"\n", "```\n", "\n", "If want to speed up the inference, set `beam_search = False`." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.48 s, sys: 98 ms, total: 7.58 s\n", "Wall time: 712 ms\n" ] }, { "data": { "text/plain": [ "'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana sat sekal ogr sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.stem(string)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'melayu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com @kesedihan rm15'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.stem(another_string)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'saya seru'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.stem('saya menyerukanlah')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sensitive towards local language structure\n", "\n", "Let us compare stemming results using Facebook comments." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "string1 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik , dia slhkn org bole hri2 crta sakau then bila kna bls balik xdpt jwb ,kata mcm biasa slh (parti sampah) 🤣🤣🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣🤣🤣'" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "string2 = 'berehatlh najib.. sudah2 lh tu.. jgn buat rakyat hilang kepercyaan tu pda system kehakiman negara.. klu btl x slh kenapa x dibuktikan semasa sblm rayuan.. sudah lah tu kami dh letih dengan drama korang. ok'" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'mula slh org boleh , bila geng tuh kena slh jgk xboleh trima . . pelik , dia slh org bole hri crta sakau then bila kna bls balik xdpt jwb , kata mcm biasa slh ( parti sampah ) 🤣 🤣 🤣 jgn mula dlu slh org kalau xboleh trima bila kna bls balik 🤣 🤣 🤣'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.stem(string1)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'mulakn slh org boleh , bila geng tuh kena slhkn jgk xboleh trima . . pelik , dia slhkn org bole hri2 crta sakau then bila kna bls balik xdpt jwb , kata mcm biasa slh ( parti sampah ) 🤣 🤣 🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣 🤣 🤣'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sastrawi.stem(string1)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'mul slh org boleh , bila geng tuh na sl jgk xboleh trima . . lik , a sl org bole hri2 crta sakau then bila kna bls balik xdpt jwb , kata mcm biasa slh ( parti sampah ) 🤣 🤣 🤣 jgn mul dlu sl org kalau xboleh trima bila kna bls balik 🤣 🤣 🤣'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "naive.stem(string1)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'rehat najib . . sudah lh tu . . jgn buat rakyat hilang percya tu pda system hakim negara . . klu btl xd slh napa x bukti semasa sblm rayu . . sudah lah tu kami dh letih dengan drama korang . ok'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.stem(string2)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'berehatlh najib . . sudah2 lh tu . . jgn buat rakyat hilang kepercyaan tu pda system hakim negara . . klu btl x slh kenapa x bukti masa sblm rayu . . sudah lah tu kami dh letih dengan drama korang . ok'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sastrawi.stem(string2)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'eha najib . . sudah2 lh tu . . jgn buat rakyat hilang percya tu pda system hakim negara . . klu btl x slh napa x bukti masa sblm rayu . . sudah lah tu kami dh letih deng drama korang . ok'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "naive.stem(string2)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }