Stemmer and Lemmatization#

This tutorial is available as an IPython notebook at Malaya/example/stemmer.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import logging

logging.basicConfig(level=logging.INFO)
[3]:
%%time
import malaya
CPU times: user 3.01 s, sys: 2.5 s, total: 5.51 s
Wall time: 2.53 s
/home/ubuntu/dev/malaya/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/ubuntu/dev/malaya/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[4]:
string = 'Benda yg SALAH ni, jgn lah didebatkan. Yg SALAH xkan jadi betul. Ingat tu. Mcm mana kesat sekalipun org sampaikan mesej, dan memang benda tu salah, diam je. Xyah nk tunjuk kau open sangat nk tegur cara org lain berdakwah'
another_string = 'melayu bodoh, dah la gay, sokong lgbt lagi, memang tak guna, http://twitter.com @kesedihan rm15'

Use Naive stemmer#

Simply use regex pattern to do stemming. This method not able to lemmatize.

def naive():
    """
    Load stemming model using startswith and endswith naively using regex patterns.

    Returns
    -------
    result : malaya.stem.NAIVE class
    """
[5]:
naive = malaya.stem.naive()

Stem and lemmatization#

def stem(self, string: str):
    """
    Stem a string using Regex pattern.

    Parameters
    ----------
    string : str

    Returns
    -------
    result: str
    """
[6]:
naive.stem('saya menyerukanlah')
[6]:
'saya yerukan'
[7]:
naive.stem('menarik')
[7]:
'arik'
[8]:
naive.stem('slhlah')
[8]:
'slh'
[9]:
naive.stem(string)
[9]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH x jadi betul . Ingat tu . Mcm mana sat kalipun org sampai sej , dan ang benda tu sa , am je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[10]:
naive.stem(another_string)
[10]:
'layu bodoh , dah la gay , sokong lgbt lagi , ang tak guna , http://twitter.com @kesedihan rm15'

Use Sastrawi stemmer#

Malaya also included interface for https://pypi.org/project/PySastrawi/. We use it for internal purpose. To use it, simply,

def sastrawi():
    """
    Load stemming model using Sastrawi, this also include lemmatization.

    Returns
    -------
    result: malaya.stem.SASTRAWI class
    """
[11]:
sastrawi = malaya.stem.sastrawi()

Stem and lemmatization#

def stem(self, string: str):
    """
    Stem a string using Sastrawi, this also include lemmatization.

    Parameters
    ----------
    string : str

    Returns
    -------
    result: str
    """
[12]:
sastrawi.stem('saya menyerukanlah')
[12]:
'saya seru'
[13]:
sastrawi.stem('slhlah')
[13]:
'slhlah'
[14]:
sastrawi.stem(string)
[14]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[15]:
sastrawi.stem(another_string)
[15]:
'melayu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com @kesedihan rm15'

List available deep learning models#

[16]:
malaya.stem.available_deep_model()
INFO:malaya.stem:trained on 90% dataset, tested on another 10% test set, dataset at https://github.com/huseinzol05/malay-dataset/tree/master/normalization/stemmer
INFO:malaya.stem:`base` tested on non-noisy dataset, while `noisy` tested on noisy dataset.
[16]:
Size (MB) Quantized Size (MB) CER WER
base 13.6 3.64 0.021438 0.043996
noisy 28.5 7.30 0.021388 0.049527

Use deep learning model#

Load LSTM + Bahdanau Attention stemming model, this also include lemmatization.

If you are using Tensorflow 2, make sure Tensorflow Addons already installed,

pip install tensorflow-addons U

Check compatible Tensorflow version with Tensorflow Addons at https://github.com/tensorflow/addons/releases

def deep_model(model: str = 'base', quantized: bool = False, **kwargs):
    """
    Load LSTM + Bahdanau Attention stemming model, BPE level (YouTokenToMe 1000 vocab size).
    This model also include lemmatization.

    Parameters
    ----------
    model : str, optional (default='base')
        Model architecture supported. Allowed values:

        * ``'base'`` - trained on default dataset.
        * ``'noisy'`` - trained on default and augmentation dataset.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: malaya.stem.DeepStemmer class
    """
[17]:
model = malaya.stem.deep_model(model = 'base')
model_noisy = malaya.stem.deep_model(model = 'noisy')
INFO:malaya_boilerplate.frozen_graph:running home/ubuntu/.cache/huggingface/hub using device /device:CPU:0
2022-09-01 21:21:32.807241: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-01 21:21:32.811084: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-09-01 21:21:32.811102: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: huseincomel-desktop
2022-09-01 21:21:32.811106: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: huseincomel-desktop
2022-09-01 21:21:32.811160: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-09-01 21:21:32.811177: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3
INFO:malaya_boilerplate.frozen_graph:running home/ubuntu/.cache/huggingface/hub using device /device:CPU:0

Load Quantized model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[18]:
quantized_model = malaya.stem.deep_model(model = 'base', quantized = True)
quantized_model_noisy = malaya.stem.deep_model(model = 'noisy', quantized = True)
WARNING:malaya_boilerplate.huggingface:Load quantized model will cause accuracy drop.
INFO:malaya_boilerplate.frozen_graph:running home/ubuntu/.cache/huggingface/hub using device /device:CPU:0
WARNING:malaya_boilerplate.huggingface:Load quantized model will cause accuracy drop.
INFO:malaya_boilerplate.frozen_graph:running home/ubuntu/.cache/huggingface/hub using device /device:CPU:0

Stem and lemmatization#

def stem(self, string: str, beam_search: bool = True):
    """
    Stem a string, this also include lemmatization.

    Parameters
    ----------
    string : str
    beam_search : bool, (optional=True)
        If True, use beam search decoder, else use greedy decoder.

    Returns
    -------
    result: str
    """

If want to speed up the inference, set beam_search = False.

[19]:
%%time

model.stem(string)
CPU times: user 257 ms, sys: 24.3 ms, total: 282 ms
Wall time: 218 ms
[19]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[20]:
%%time

model_noisy.stem(string)
CPU times: user 369 ms, sys: 12.9 ms, total: 382 ms
Wall time: 300 ms
[20]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekali org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[21]:
%%time

model.stem(string, beam_search = False)
CPU times: user 110 ms, sys: 0 ns, total: 110 ms
Wall time: 70.7 ms
[21]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[22]:
%%time

quantized_model.stem(string)
CPU times: user 197 ms, sys: 24.5 ms, total: 221 ms
Wall time: 157 ms
[22]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[23]:
%%time

quantized_model.stem(string, beam_search = False)
CPU times: user 95.9 ms, sys: 6.02 ms, total: 102 ms
Wall time: 61.2 ms
[23]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[24]:
model.stem(another_string)
[24]:
'melayu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com @kesedihan rm15'
[25]:
model_noisy.stem(another_string)
[25]:
'melayu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com @kesedihan rm15'
[26]:
quantized_model.stem(another_string)
[26]:
'melayu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com @kesedihan rm15'
[27]:
model.stem('saya menyerukanlah')
[27]:
'saya seru'
[28]:
model_noisy.stem('saya menyerukanlah')
[28]:
'saya seru'
[29]:
quantized_model.stem('saya menyerukanlah')
[29]:
'saya seru'

Sensitive towards local language structure#

Let us compare stemming results using Facebook comments.

[30]:
string1 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik , dia slhkn org bole hri2 crta sakau then bila kna bls balik xdpt jwb ,kata mcm biasa slh (parti sampah) 🤣🤣🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣🤣🤣'
[31]:
string2 = 'berehatlh najib.. sudah2 lh tu.. jgn buat rakyat hilang kepercyaan tu pda system kehakiman negara.. klu btl x slh kenapa x dibuktikan semasa sblm rayuan.. sudah lah tu kami dh letih dengan drama korang. ok'
[32]:
model.stem(string1)
[32]:
'mulakn slh org boleh , bila geng tuh kena slhkn jgk xboleh trima . . pelik , dia slhkn org bole hri crta sakau then bila kna bls balik xdpt jwb , kata mcm biasa slh ( parti sampah ) 🤣 🤣 🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣 🤣 🤣'
[33]:
model_noisy.stem(string1)
[33]:
'mula slh org boleh , bila geng tuh kena slh jgk xboleh trima . . pelik , dia slh org bole hri crta sakau then bila kna bls balik xdpt jwb , kata mcm biasa slh ( parti sampah ) 🤣 🤣 🤣 jgn mula dlu slh org kalau xboleh trima bila kna bls balik 🤣 🤣 🤣'
[34]:
sastrawi.stem(string1)
[34]:
'mulakn slh org boleh , bila geng tuh kena slhkn jgk xboleh trima . . pelik , dia slhkn org bole hri2 crta sakau then bila kna bls balik xdpt jwb , kata mcm biasa slh ( parti sampah ) 🤣 🤣 🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣 🤣 🤣'
[35]:
naive.stem(string1)
[35]:
'mulakn slh org boleh , bila geng tuh na slhkn jgk xboleh trima . . lik , a slhkn org bole hri2 crta sakau then bila kna bls balik xdpt jwb , kata mcm biasa slh ( parti sampah ) 🤣 🤣 🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣 🤣 🤣'
[36]:
model.stem(string2)
[36]:
'berehatlh najib . . sudah-l lh tu . . jgn buat rakyat hilang percya tu pda system hakim negara . . klu btl x slh kenapa x bukti masa sblm rayu . . sudah lah tu kami dh letih dengan drama korang . ok'
[37]:
model_noisy.stem(string2)
[37]:
'rehat najib . . sudahn lh tu . . jgn buat rakyat hilang kepercyaan tu pda system hakim negara . . klu btl x slh kenapa x bukti semasa sblm rayu . . sudah lah tu kami dh letih dengan drama korang . ok'
[38]:
sastrawi.stem(string2)
[38]:
'berehatlh najib . . sudah2 lh tu . . jgn buat rakyat hilang kepercyaan tu pda system hakim negara . . klu btl x slh kenapa x bukti masa sblm rayu . . sudah lah tu kami dh letih dengan drama korang . ok'
[39]:
naive.stem(string2)
[39]:
'eha najib . . sudah2 lh tu . . jgn buat rakyat hilang percya tu pda system hakim negara . . klu btl x slh napa x bukti masa sblm rayu . . sudah lah tu kami dh letih deng drama korang . ok'