{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Stemmer and Lemmatization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/stemmer](https://github.com/huseinzol05/Malaya/tree/master/example/stemmer).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module only trained on standard language structure, so it is not save to use it for local language structure.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.81 s, sys: 652 ms, total: 5.47 s\n", "Wall time: 4.44 s\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "string = 'Benda yg SALAH ni, jgn lah didebatkan. Yg SALAH xkan jadi betul. Ingat tu. Mcm mana kesat sekalipun org sampaikan mesej, dan memang benda tu salah, diam je. Xyah nk tunjuk kau open sangat nk tegur cara org lain berdakwah'\n", "another_string = 'melayu bodoh, dah la gay, sokong lgbt lagi, memang tak guna, http://twitter.com'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use deep learning model\n", "\n", "Load LSTM + Bahdanau Attention stemming model, this also include lemmatization.\n", "\n", "If you are using Tensorflow 2, make sure Tensorflow Addons already installed,\n", "\n", "```bash\n", "pip install tensorflow-addons U\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "def deep_model(quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load LSTM + Bahdanau Attention stemming model, this also include lemmatization.\n", " Original size 41.6MB, quantized size 10.6MB .\n", "\n", " Parameters\n", " ----------\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result: malaya.stem.DEEP_STEMMER class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [], "source": [ "model = malaya.stem.deep_model()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "quantized_model = malaya.stem.deep_model(quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Stem and lemmatization\n", "\n", "```python\n", "def stem(self, string: str, beam_search: bool = True):\n", " \"\"\"\n", " Stem a string, this also include lemmatization.\n", "\n", " Parameters\n", " ----------\n", " string : str\n", " beam_search : bool, (optional=True)\n", " If True, use beam search decoder, else use greedy decoder.\n", "\n", " Returns\n", " -------\n", " result: str\n", " \"\"\"\n", "```\n", "\n", "If want to speed up the inference, set `beam_search = False`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.22 s, sys: 305 ms, total: 1.52 s\n", "Wall time: 540 ms\n" ] }, { "data": { "text/plain": [ "'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.stem(string)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 285 ms, sys: 102 ms, total: 387 ms\n", "Wall time: 289 ms\n" ] }, { "data": { "text/plain": [ "'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.stem(string, beam_search = False)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.29 s, sys: 230 ms, total: 1.52 s\n", "Wall time: 573 ms\n" ] }, { "data": { "text/plain": [ "'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.stem(string)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 331 ms, sys: 105 ms, total: 436 ms\n", "Wall time: 329 ms\n" ] }, { "data": { "text/plain": [ "'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.stem(string, beam_search = False)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'layu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.stem(another_string)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'layu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_model.stem(another_string)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'saya seru'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.stem('saya menyerukanlah')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'saya seru'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_model.stem('saya menyerukanlah')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Sastrawi stemmer\n", "\n", "Malaya also included interface for [Sastrawi stemmer](https://pypi.org/project/PySastrawi/). We use it for internal purpose. To use it, simply,\n", "\n", "```python\n", "def sastrawi():\n", " \"\"\"\n", " Load stemming model using Sastrawi, this also include lemmatization.\n", "\n", " Returns\n", " -------\n", " result: malaya.stem.SASTRAWI class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sastrawi = malaya.stem.sastrawi()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'saya seru'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sastrawi.stem('saya menyerukanlah')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'tarik'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sastrawi.stem('menarik')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'melayu bodoh dah la gay sokong lgbt lagi memang tak guna http twitter com'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sastrawi.stem(another_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But it not able to maintain words like url, hashtag, money, datetime and user mention." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Naive stemmer\n", "\n", "Simply use regex pattern to do stemming. This method not able to lemmatize.\n", "\n", "```python\n", "def naive():\n", " \"\"\"\n", " Load stemming model using startswith and endswith naively using regex patterns.\n", "\n", " Returns\n", " -------\n", " result : malaya.stem.NAIVE class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "naive = malaya.stem.naive()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'saya yerukan'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "naive.stem('saya menyerukanlah')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'arik'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "naive.stem('menarik')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'layu bodoh , dah la gay , sokong lgbt lagi , ang tak guna , http://twitter.com'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "naive.stem(another_string)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }