Spelling Correction using Transformer
Contents
Spelling Correction using Transformer#
This tutorial is available as an IPython notebook at Malaya/example/spelling-correction-transformer.
[1]:
import logging
logging.basicConfig(level=logging.INFO)
[2]:
import malaya
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
[3]:
# some text examples copied from Twitter
string1 = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'
string2 = 'Husein ska mkn aym dkat kampng Jawa'
string3 = 'Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.'
string4 = 'Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.'
string5 = 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'
string6 = 'blh bntg dlm kls nlp sy, nnti intch'
string7 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik'
List available Transformer models#
We use custom spelling augmentation,
replace_similar_consonants
mereka -> nereka
replace_similar_vowels
suka -> sika
socialmedia_form
suka -> ska
vowel_alternate
singapore -> sngpore
kampung -> kmpng
[4]:
malaya.spelling_correction.transformer.available_transformer()
INFO:malaya.spelling_correction.transformer:tested on 10k generated dataset at https://github.com/huseinzol05/malaya/tree/master/session/spelling-correction/t5
[4]:
Size (MB) | Quantized Size (MB) | WER | Suggested length | |
---|---|---|---|---|
small-t5 | 355.6 | 195.0 | 0.015625 | 256.0 |
tiny-t5 | 208.0 | 103.0 | 0.023712 | 256.0 |
super-tiny-t5 | 81.8 | 27.1 | 0.038001 | 256.0 |
Load Transformer model#
def transformer(model: str = 'small-t5', quantized: bool = False, **kwargs):
"""
Load a Transformer Spell Corrector.
Parameters
----------
model : str, optional (default='small-t5')
Model architecture supported. Allowed values:
* ``'small-t5'`` - T5 SMALL parameters.
* ``'tiny-t5'`` - T5 TINY parameters.
* ``'super-tiny-t5'`` - T5 SUPER TINY parameters.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result: malaya.model.t5.Spell class
"""
[13]:
t5 = malaya.spelling_correction.transformer.transformer(model = 'tiny-t5')
Predict using greedy decoder#
def greedy_decoder(self, strings: List[str]):
"""
spelling correction for strings.
Parameters
----------
strings: List[str]
Returns
-------
result: List[str]
"""
[10]:
t5.greedy_decoder([string1])
[10]:
['kerajaan patut bagi pencen awal skt kpd warga emas supaya emosi']
[11]:
t5.greedy_decoder([string2])
[11]:
['Husein suka makan ayam dekat kampung Jawa']
[12]:
t5.greedy_decoder([string3])
[12]:
['Melayu malas ni narration dia sama je macam men are trash . True to some , false to some .']
[14]:
t5.greedy_decoder([string7])
[14]:
['mulakan slh orang boleh , bilah geng tuh kena salahkan juga xboleh terima . . pelik']