Spelling Correction

This tutorial is available as an IPython notebook at Malaya/example/spell-correction.

[1]:
%%time
import malaya
CPU times: user 5.37 s, sys: 1.03 s, total: 6.4 s
Wall time: 7.18 s
[2]:
# some text examples copied from Twitter

string1 = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'
string2 = 'Husein ska mkn aym dkat kampng Jawa'
string3 = 'Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.'
string4 = 'Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.'
string5 = 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'
string6 = 'blh bntg dlm kls nlp sy, nnti intch'

Load probability speller

The probability speller extends the functionality of the Peter Norvig’s, http://norvig.com/spell-correct.html.

And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews.

Also added custom vowels and consonant augmentation to adapt with our local shortform / typos.

def probability(sentence_piece: bool = False, **kwargs):
    """
    Train a Probability Spell Corrector.

    Parameters
    ----------
    sentence_piece: bool, optional (default=False)
        if True, reduce possible augmentation states using sentence piece.

    Returns
    -------
    result: malaya.spell.Probability class
    """
[3]:
prob_corrector = malaya.spell.probability()

To correct a word

def correct(self, word: str, **kwargs):
    """
    Most probable spelling correction for word.

    Parameters
    ----------
    word: str

    Returns
    -------
    result: str
    """
[4]:
prob_corrector.correct('sy')
[4]:
'saya'
[5]:
prob_corrector.correct('mhthir')
[5]:
'mahathir'
[6]:
prob_corrector.correct('mknn')
[6]:
'makanan'

List possible generated pool of words

def edit_candidates(self, word):
    """
    Generate candidates given a word.

    Parameters
    ----------
    word: str

    Returns
    -------
    result: List[str]
    """
[7]:
prob_corrector.edit_candidates('mhthir')
[7]:
['mahathir']
[8]:
prob_corrector.edit_candidates('smbng')
[8]:
['sembang',
 'smbg',
 'sambung',
 'simbang',
 'sembung',
 'sumbang',
 'sambong',
 'sambang',
 'sumbing',
 'sombong',
 'sembong']

Now you can see, edit_candidates suggested quite a lot candidates and some of candidates not an actual word like sambang, to reduce that, we can use sentencepiece to check a candidate a legit word for malaysia context or not.

[10]:
prob_corrector_sp = malaya.spell.probability(sentence_piece = True)
prob_corrector_sp.edit_candidates('smbng')
[10]:
['sumbing',
 'sambung',
 'smbg',
 'sembung',
 'sombong',
 'sembong',
 'sembang',
 'sumbang',
 'sambong']

So how does the model knows which words need to pick? highest counts from the corpus!

To correct a sentence

def correct_text(self, text: str):
    """
    Correct all the words within a text, returning the corrected text.

    Parameters
    ----------
    text: str

    Returns
    -------
    result: str
    """
[9]:
prob_corrector.correct_text(string1)
[9]:
'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'
[10]:
prob_corrector.correct_text(string2)
[10]:
'Husein suka makan ayam dekat kampung Jawa'
[11]:
prob_corrector.correct_text(string3)
[11]:
'Melayu malas ini narration dia sama sahaja macam men are trash. True to some, false to some.'
[12]:
prob_corrector.correct_text(string4)
[12]:
'Tapi tak fikir ke bahaya perpetuate myths macam itu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah. Your kids will be victims of that too.'
[13]:
prob_corrector.correct_text(string5)
[13]:
'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as saya am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'
[14]:
prob_corrector.correct_text(string6)
[14]:
'boleh bintang dalam kelas nlp saya, nanti intch'

Load JamSpell speller

JamSpell use Norvig + Ngram words.

Before you able to use this spelling correction, you need to install jamspell,

For mac,

wget http://prdownloads.sourceforge.net/swig/swig-3.0.12.tar.gz
tar -zxf swig-3.0.12.tar.gz
./swig-3.0.12/configure && make && make install
pip3 install jamspell

For debian / ubuntu,

apt install swig3
pip3 install jamspell
def jamspell(model: str = 'wiki+news', **kwargs):
    """
    Load a jamspell Spell Corrector for Malay.

    Parameters
    ----------
    model: str, optional (default='wiki+news')
        Supported models. Allowed values:

        * ``'wiki+news'`` - Wikipedia + News, 337MB.
        * ``'wiki'`` - Wikipedia, 148MB.
        * ``'news'`` - Wikipedia, 215MB.

    Returns
    -------
    result: malaya.spell.JamSpell class
    """
[4]:
model = malaya.spell.jamspell(model = 'wiki')

To correct a word

def correct(self, word: str, string: str, index: int = -1):
    """
    Correct a word within a text, returning the corrected word.

    Parameters
    ----------
    word: str
    string: str
        Entire string, `word` must a word inside `string`.
    index: int, optional(default=-1)
        index of word in the string, if -1, will try to use `string.index(word)`.

    Returns
    -------
    result: str
    """
[5]:
model.correct('suke', 'saya suke makan iyom')
[5]:
'suka'

List possible generated pool of words

def edit_candidates(self, word: str, string: str, index: int = -1):
    """
    Generate candidates given a word.

    Parameters
    ----------
    word: str
    string: str
        Entire string, `word` must a word inside `string`.
    index: int, optional(default=-1)
        index of word in the string, if -1, will try to use `string.index(word)`.

    Returns
    -------
    result: List[str]
    """
[15]:
model.edit_candidates('ayem', 'saya suke makan ayem')
[15]:
('ayem',
 'ayam',
 'ayer',
 'aye',
 'asem',
 'yem',
 'adem',
 'alem',
 'aem',
 'ayim',
 'oyem',
 'ayew',
 'azem',
 'ajem',
 'ayiem')

To correct a sentence

def correct_text(self, text: str):
    """
    Correct all the words within a text, returning the corrected text.

    Parameters
    ----------
    text: str

    Returns
    -------
    result: str
    """
[17]:
model.correct_text('saya suke makan ayom')
[17]:
'saya suka makan ayam'

Load Spylls speller

Spylls is Hunspell ported to Python.

Before you able to use this spelling correction, you need to install spylls,

pip3 install Spylls
def spylls(model: str = 'libreoffice-pejam', **kwargs):
    """
    Load a spylls Spell Corrector for Malay.

    Parameters
    ----------
    model : str, optional (default='libreoffice-pejam')
        Model spelling correction supported. Allowed values:

        * ``'libreoffice-pejam'`` - from LibreOffice pEJAm, https://extensions.libreoffice.org/en/extensions/show/3868

    Returns
    -------
    result: malaya.spell.Spylls class
    """
[17]:
model = malaya.spell.spylls()

To correct a word

def correct(self, word: str):
    """
    Correct a word within a text, returning the corrected word.

    Parameters
    ----------
    word: str

    Returns
    -------
    result: str
    """
[18]:
model.correct('sy')
[18]:
'st'
[19]:
model.correct('mhthir')
[19]:
'Mahathir'
[20]:
model.correct('mknn')
[20]:
'knn'

List possible generated pool of words

def edit_candidates(self, word: str):
    """
    Generate candidates given a word.

    Parameters
    ----------
    word: str

    Returns
    -------
    result: List[str]
    """
    return list(self._dictionary.suggest(word))
[21]:
model.edit_candidates('mhthir')
[21]:
['Mahathir']
[22]:
model.edit_candidates('smbng')
[22]:
['sbng', 'smbang', 'jmbng', 'cmbng']

To correct a sentence

def correct_text(self, text: str):
    """
    Correct all the words within a text, returning the corrected text.

    Parameters
    ----------
    text: str

    Returns
    -------
    result: str
    """
[23]:
model.correct_text(string1)
[23]:
'kerajaan putat baji pencen awal tks dpk warga enas supaya emisi'
[24]:
model.correct_text(string2)
[24]:
'Husein sak mkn aum tkad kampang Jawa'
[25]:
model.correct_text(string3)
[25]:
'Melayu malas in paration ida asma ja macam man ara tras. True tu som, falsafah tu som.'
[26]:
model.correct_text(string4)
[26]:
'Tapi kat fikir ka bahaya terperbuat smythea catu. Nanti kalau ada giring diskriminatif desiliter our food identification becus of our reca tua pukal ramah. Your kias wila ba victoria of rhat oto.'

Load Encoder transformer speller

This spelling correction is a transformer based, improvement version of malaya.spell.probability. Problem with malaya.spell.probability, it naively picked highest probability of word based on public sentences (wiki, news and social media) without understand actual context, example,

string = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'
prob_corrector = malaya.spell.probability()
prob_corrector.correct_text(string)
-> 'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'

It supposely replaced skt with sikit, a common word people use in social media to give a little bit of attention to pencen. So, to fix that, we can use Transformer model! Right now transformer speller supported ``BERT``, ``ALBERT`` and ``ELECTRA`` only.

def transformer_encoder(model, sentence_piece: bool = False, **kwargs):
    """
    Load a Transformer Encoder Spell Corrector. Right now only supported BERT, ALBERT and ELECTRA.

    Parameters
    ----------
    sentence_piece: bool, optional (default=False)
        if True, reduce possible augmentation states using sentence piece.

    Returns
    -------
    result: malaya.spell.Transformer class
    """
[3]:
model = malaya.transformer.load(model = 'electra')
transformer_corrector = malaya.spell.transformer_encoder(model, sentence_piece = True)
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:56: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/modeling.py:240: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
WARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow_core/python/layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:79: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:93: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/sampling.py:26: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:115: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:118: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:119: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:121: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:122: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:128: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:130: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/base/electra-base/model.ckpt

To correct a sentence

def correct_text(self, text: str, batch_size: int = 20):
    """
    Correct all the words within a text, returning the corrected text.

    Parameters
    ----------
    text: str
    batch_size: int, optional(default=20)
        batch size to insert into model.

    Returns
    -------
    result: str
    """
[4]:
transformer_corrector.correct_text(string1)
[4]:
'kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi'

perfect! But again, transformer model is very expensive! You can compare the time wall with probability based.

[6]:
%%time
transformer_corrector.correct_text(string1)
CPU times: user 21.8 s, sys: 1.19 s, total: 23 s
Wall time: 5.15 s
[6]:
'kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi'
[7]:
%%time
prob_corrector.correct_text(string1)
CPU times: user 108 ms, sys: 3.34 ms, total: 112 ms
Wall time: 112 ms
[7]:
'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'

Load symspeller speller

This spelling correction is an improvement version for symspeller to adapt with our local shortform / typos. Before you able to use this spelling correction, you need to install symspeller,

pip install symspellpy
def symspell(
    max_edit_distance_dictionary: int = 2,
    prefix_length: int = 7,
    term_index: int = 0,
    count_index: int = 1,
    top_k: int = 10,
    **kwargs
):
    """
    Train a symspell Spell Corrector.

    Returns
    -------
    result: malaya.spell.Symspell class
    """
[11]:
symspell_corrector = malaya.spell.symspell()

To correct a word

def correct(self, word: str, **kwargs):
    """
    Most probable spelling correction for word.

    Parameters
    ----------
    word: str

    Returns
    -------
    result: str
    """
[12]:
symspell_corrector.correct('bntng')
[12]:
'bintang'
[13]:
symspell_corrector.correct('kerajaan')
[13]:
'kerajaan'
[14]:
symspell_corrector.correct('mknn')
[14]:
'makanan'

List possible generated words

def edit_step(self, word):
    """
    Generate candidates given a word.

    Parameters
    ----------
    word: str

    Returns
    -------
    result: List[str]
    """
[15]:
symspell_corrector.edit_step('mrh')
[15]:
{'marah': 12684.0,
 'merah': 21448.5,
 'arah': 15066.5,
 'darah': 10003.0,
 'mara': 7504.5,
 'malah': 7450.0,
 'zarah': 3753.5,
 'murah': 3575.5,
 'barah': 2707.5,
 'march': 2540.5,
 'martha': 390.0,
 'marsha': 389.0,
 'maratha': 88.5,
 'marcha': 22.5,
 'karaha': 13.5,
 'maraba': 13.5,
 'varaha': 11.5,
 'marana': 4.5,
 'marama': 4.5}

To correct a sentence

def correct_text(self, text: str):
    """
    Correct all the words within a text, returning the corrected text.

    Parameters
    ----------
    text: str

    Returns
    -------
    result: str
    """
[16]:
symspell_corrector.correct_text(string1)
[16]:
'kerajaan patut bagi pencen awal saat kepada warga emas supaya emosi'
[17]:
symspell_corrector.correct_text(string2)
[17]:
'Husein suka makan ayam dapat kampung Jawa'
[18]:
symspell_corrector.correct_text(string3)
[18]:
'Melayu malas ni narration dia sama sahaja macam men are trash. True to some, false to some.'
[19]:
symspell_corrector.correct_text(string4)
[19]:
'Tapi tak fikir ke bahaya perpetuate maathai macam itu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah. Your kids will be victims of that too.'
[20]:
symspell_corrector.correct_text(string5)
[20]:
'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as saya am edging towards retirement in 4-5 aras time after a career of being an Engineer, Project Manager, General Manager'
[21]:
symspell_corrector.correct_text(string6)
[21]:
'boleh bintang dalam kelas malaya saya, nanti mintalah'

List available Transformer models

We use custom spelling augmentation,

  1. replace_similar_consonants

  • mereka -> nereka

  1. replace_similar_vowels

  • suka -> sika

  1. socialmedia_form

  • suka -> ska

  1. vowel_alternate

  • singapore -> sngpore

  • kampung -> kmpng

[4]:
malaya.spell.available_transformer()
INFO:root:tested on 10k test set.
[4]:
Size (MB) Quantized Size (MB) WER Suggested length
small-t5 355.6 195.0 0.015625 256.0
tiny-t5 208.0 103.0 0.023712 256.0
super-tiny-t5 81.8 27.1 0.038001 256.0

Load Transformer model

def transformer(model: str = 'small-t5', quantized: bool = False, **kwargs):
    """
    Load a Transformer Spell Corrector.

    Parameters
    ----------
    model : str, optional (default='small-t5')
        Model architecture supported. Allowed values:

        * ``'small-t5'`` - T5 SMALL parameters.
        * ``'tiny-t5'`` - T5 TINY parameters.
        * ``'super-tiny-t5'`` - T5 SUPER TINY parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: malaya.model.t5.Spell class
    """
[3]:
t5 = malaya.spell.transformer(model = 'tiny-t5')

Predict using greedy decoder

def greedy_decoder(self, strings: List[str]):
    """
    spelling correction for strings.

    Parameters
    ----------
    strings: List[str]

    Returns
    -------
    result: List[str]
    """
[6]:
t5.greedy_decoder([string1])
[6]:
['kerajaan patut bagi pencen awal skt kpd warga emas supaya emosi']
[8]:
t5.greedy_decoder([string2])
[8]:
['Husein suka makan ayam dekat kampung Jawa']
[7]:
t5.greedy_decoder([string3])
[7]:
['Melayu malas ni narration dia sama je macam men are trash . True to some , false to some .']