Stemmer and Lemmatization

This tutorial is available as an IPython notebook at Malaya/example/stemmer.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:
%%time
import malaya
CPU times: user 4.81 s, sys: 652 ms, total: 5.47 s
Wall time: 4.44 s
[2]:
string = 'Benda yg SALAH ni, jgn lah didebatkan. Yg SALAH xkan jadi betul. Ingat tu. Mcm mana kesat sekalipun org sampaikan mesej, dan memang benda tu salah, diam je. Xyah nk tunjuk kau open sangat nk tegur cara org lain berdakwah'
another_string = 'melayu bodoh, dah la gay, sokong lgbt lagi, memang tak guna, http://twitter.com'

Use deep learning model

Load LSTM + Bahdanau Attention stemming model, this also include lemmatization.

def deep_model(quantized: bool = False, **kwargs):
    """
    Load LSTM + Bahdanau Attention stemming model, this also include lemmatization.
    Original size 41.6MB, quantized size 10.6MB .

    Parameters
    ----------
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: malaya.stem.DEEP_STEMMER class
    """
[9]:
model = malaya.stem.deep_model()

Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[8]:
quantized_model = malaya.stem.deep_model(quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

Stem and lemmatization

def stem(self, string: str, beam_search: bool = True):
    """
    Stem a string, this also include lemmatization.

    Parameters
    ----------
    string : str
    beam_search : bool, (optional=True)
        If True, use beam search decoder, else use greedy decoder.

    Returns
    -------
    result: str
    """

If want to speed up the inference, set beam_search = False.

[4]:
%%time

model.stem(string)
CPU times: user 1.22 s, sys: 305 ms, total: 1.52 s
Wall time: 540 ms
[4]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[5]:
%%time

model.stem(string, beam_search = False)
CPU times: user 285 ms, sys: 102 ms, total: 387 ms
Wall time: 289 ms
[5]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[6]:
%%time

quantized_model.stem(string)
CPU times: user 1.29 s, sys: 230 ms, total: 1.52 s
Wall time: 573 ms
[6]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[7]:
%%time

quantized_model.stem(string, beam_search = False)
CPU times: user 331 ms, sys: 105 ms, total: 436 ms
Wall time: 329 ms
[7]:
'Benda yg SALAH ni , jgn lah debat . Yg SALAH xkan jadi betul . Ingat tu . Mcm mana kesat sekalipun org sampai mesej , dan memang benda tu salah , diam je . Xyah nk tunjuk kau open sangat nk tegur cara org lain dakwah'
[8]:
model.stem(another_string)
[8]:
'layu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com'
[9]:
quantized_model.stem(another_string)
[9]:
'layu bodoh , dah la gay , sokong lgbt lagi , memang tak guna , http://twitter.com'
[11]:
model.stem('saya menyerukanlah')
[11]:
'saya seru'
[10]:
quantized_model.stem('saya menyerukanlah')
[10]:
'saya seru'

Use Sastrawi stemmer

Malaya also included interface for Sastrawi stemmer. We use it for internal purpose. To use it, simply,

def sastrawi():
    """
    Load stemming model using Sastrawi, this also include lemmatization.

    Returns
    -------
    result: malaya.stem.SASTRAWI class
    """
[3]:
sastrawi = malaya.stem.sastrawi()
[4]:
sastrawi.stem('saya menyerukanlah')
[4]:
'saya seru'
[5]:
sastrawi.stem('menarik')
[5]:
'tarik'
[6]:
sastrawi.stem(another_string)
[6]:
'melayu bodoh dah la gay sokong lgbt lagi memang tak guna http twitter com'

But it not able to maintain words like url, hashtag, money, datetime and user mention.

Use Naive stemmer

Simply use regex pattern to do stemming. This method not able to lemmatize.

def naive():
    """
    Load stemming model using startswith and endswith naively using regex patterns.

    Returns
    -------
    result : malaya.stem.NAIVE class
    """
[7]:
naive = malaya.stem.naive()
[8]:
naive.stem('saya menyerukanlah')
[8]:
'saya yerukan'
[9]:
naive.stem('menarik')
[9]:
'arik'
[10]:
naive.stem(another_string)
[10]:
'layu bodoh , dah la gay , sokong lgbt lagi , ang tak guna , http://twitter.com'