Compare LM on Spelling Correction
Contents
Compare LM on Spelling Correction#
This tutorial is available as an IPython notebook at Malaya/example/compare-lm-spelling-correction.
Malaya got 3 different LM models,
KenLM
GPT2
Masked LM
So we are going to compare the spelling correction results.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
[2]:
import logging
logging.basicConfig(level=logging.INFO)
[3]:
import malaya
/home/husein/dev/malaya/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[4]:
# some text examples copied from Twitter
string1 = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'
string2 = 'Husein ska mkn aym dkat kampng Jawa'
string3 = 'Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.'
string4 = 'Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.'
string5 = 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'
string6 = 'blh bntg dlm kls nlp sy, nnti intch'
string7 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik'
Load probability model#
def load(
language_model=None,
sentence_piece: bool = False,
stemmer=None,
**kwargs,
):
"""
Load a Probability Spell Corrector.
Parameters
----------
language_model: Callable, optional (default=None)
If not None, must an object with `score` method.
sentence_piece: bool, optional (default=False)
if True, reduce possible augmentation states using sentence piece.
stemmer: Callable, optional (default=None)
a Callable object, must have `stem_word` method.
Returns
-------
result: model
List of model classes:
* if passed `language_model` will return `malaya.spelling_correction.probability.ProbabilityLM`.
* else will return `malaya.spelling_correction.probability.Probability`.
"""
[5]:
kenlm = malaya.language_model.kenlm()
kenlm
[5]:
<Model from b'model.klm'>
[6]:
malaya.language_model.available_mlm()
[6]:
Size (MB) | |
---|---|
malay-huggingface/bert-base-bahasa-cased | 310.0 |
malay-huggingface/bert-tiny-bahasa-cased | 66.1 |
malay-huggingface/albert-base-bahasa-cased | 45.9 |
malay-huggingface/albert-tiny-bahasa-cased | 22.6 |
mesolitica/roberta-base-bahasa-cased | 443.0 |
mesolitica/roberta-tiny-bahasa-cased | 66.1 |
[10]:
bert_base = malaya.language_model.mlm(model = 'malay-huggingface/bert-base-bahasa-cased')
bert_tiny = malaya.language_model.mlm(model = 'malay-huggingface/bert-tiny-bahasa-cased')
albert_base = malaya.language_model.mlm(model = 'malay-huggingface/albert-base-bahasa-cased')
albert_tiny = malaya.language_model.mlm(model = 'malay-huggingface/albert-tiny-bahasa-cased')
roberta_base = malaya.language_model.mlm(model = 'mesolitica/roberta-base-bahasa-cased')
[7]:
malaya.language_model.available_gpt2()
[7]:
Size (MB) | |
---|---|
mesolitica/gpt2-117m-bahasa-cased | 454 |
[8]:
gpt2 = malaya.language_model.gpt2(model = 'mesolitica/gpt2-117m-bahasa-cased')
[20]:
model_kenlm = malaya.spelling_correction.probability.load(language_model = kenlm)
INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json
[21]:
model_bert_base = malaya.spelling_correction.probability.load(language_model = bert_base)
INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json
[22]:
model_bert_tiny = malaya.spelling_correction.probability.load(language_model = bert_tiny)
INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json
[23]:
model_albert_base = malaya.spelling_correction.probability.load(language_model = albert_base)
INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json
[24]:
model_albert_tiny = malaya.spelling_correction.probability.load(language_model = albert_tiny)
INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json
[25]:
model_roberta_base = malaya.spelling_correction.probability.load(language_model = roberta_base)
INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json
[11]:
model_gpt2 = malaya.spelling_correction.probability.load(language_model = gpt2)
INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json
To correct a sentence#
def correct_text(
self,
text: str,
lookback: int = 3,
lookforward: int = 3,
):
"""
Correct all the words within a text, returning the corrected text.
Parameters
----------
text: str
lookback: int, optional (default=3)
N words on the left hand side.
if put -1, will take all words on the left hand side.
longer left hand side will take longer to compute.
lookforward: int, optional (default=3)
N words on the right hand side.
if put -1, will take all words on the right hand side.
longer right hand side will take longer to compute.
Returns
-------
result: str
"""
[12]:
strings = [string1, string2, string3, string4, string5, string6, string7]
[13]:
tokenizer = malaya.tokenizer.Tokenizer()
[26]:
for s in strings:
tokenized = tokenizer.tokenize(s)
print('original:', s)
print('corrected:', model_kenlm.correct_text(' '.join(tokenized)))
print()
original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi
corrected: kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi
original: Husein ska mkn aym dkat kampng Jawa
corrected: Husin suka makan ayam dekat kampung Jawa
original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.
corrected: Melayu malas ini narration dia sama sahaja macam men are trash . True to some , false to some .
original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.
corrected: Tapi tak fikir ke bahaya perpetuate myths macam itu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah . Your kids will be victims of that too .
original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager
corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager
original: blh bntg dlm kls nlp sy, nnti intch
corrected: boleh bintang dalam kelas nlp saya , nanti intch
original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik
corrected: mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik
[27]:
for s in strings:
tokenized = tokenizer.tokenize(s)
print('original:', s)
print('corrected:', model_bert_base.correct_text(' '.join(tokenized)))
print()
original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi
corrected: kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi
original: Husein ska mkn aym dkat kampng Jawa
corrected: Husin suka makin ayam dekat kampung Jawa
original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.
corrected: Melayu malas ini narration dia sama sahaja macam men are trash . True to some , false to some .
original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.
corrected: Tapi tak fikir ke bahaya perpetuate myths macam itu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah . Your kids will be victims of that too .
original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager
corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager
original: blh bntg dlm kls nlp sy, nnti intch
corrected: boleh bentang dalam kls nlp saya , nanti intch
original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik
corrected: mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik
[28]:
for s in strings:
tokenized = tokenizer.tokenize(s)
print('original:', s)
print('corrected:', model_bert_tiny.correct_text(' '.join(tokenized)))
print()
original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi
corrected: kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi
original: Husein ska mkn aym dkat kampng Jawa
corrected: Husin suka makin ayam dekat kampung Jawa
original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.
corrected: Melayu malas ini narration dia sama sahaja macam men are trash . True to some , false to some .
original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.
corrected: Tapi tak fikir ke bahaya perpetuate myths macam itu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah . Your kids will be victims of that too .
original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager
corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager
original: blh bntg dlm kls nlp sy, nnti intch
corrected: boleh bentang dalam kls nlp saya , nanti intch
original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik
corrected: mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik
[29]:
for s in strings:
tokenized = tokenizer.tokenize(s)
print('original:', s)
print('corrected:', model_albert_base.correct_text(' '.join(tokenized)))
print()
original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi
corrected: kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi
original: Husein ska mkn aym dkat kampng Jawa
corrected: Husin suka mukin yama dekat kampung Jawa
original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.
corrected: Melayu malas ini narration dia sama sahaja macam men are trash . True to some , false to some .
original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.
corrected: Tapi tak fikir ke bahaya perpetuate myths macam itu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah . Your kids will be victims of that too .
original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager
corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager
original: blh bntg dlm kls nlp sy, nnti intch
corrected: boleh bintang dalam kelas nlp saya , nanti intch
original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik
corrected: mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik
[30]:
for s in strings:
tokenized = tokenizer.tokenize(s)
print('original:', s)
print('corrected:', model_albert_tiny.correct_text(' '.join(tokenized)))
print()
original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi
corrected: kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi
original: Husein ska mkn aym dkat kampng Jawa
corrected: Husin suka makin ayam dekat kumpang Jawa
original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.
corrected: Melayu malas ini narration dia sama sahaja macam men are trash . True to some , false to some .
original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.
corrected: Tapi tak fikir ke bahaya perpetuate myths macam itu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah . Your kids will be victims of that too .
original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager
corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager
original: blh bntg dlm kls nlp sy, nnti intch
corrected: boleh bintang dalam kelas nlp saya , nanti intch
original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik
corrected: mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik
[31]:
for s in strings:
tokenized = tokenizer.tokenize(s)
print('original:', s)
print('corrected:', model_roberta_base.correct_text(' '.join(tokenized)))
print()
original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi
corrected: kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi
original: Husein ska mkn aym dkat kampng Jawa
corrected: Hussein suka mkn ayam dekat kampung Jawa
original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.
corrected: Melayu malas ini narration dia sama sahaja macam men are trash . True to some , false to some .
original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.
corrected: Tapi tak fikir ke bahaya perpetuate myths macam itu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah . Your kids will be victims of that too .
original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager
corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager
original: blh bntg dlm kls nlp sy, nnti intch
corrected: boleh banting dalam klise nlp saye , nanti intch
original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik
corrected: mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik
[14]:
for s in strings:
tokenized = tokenizer.tokenize(s)
print('original:', s)
print('corrected:', model_gpt2.correct_text(' '.join(tokenized)))
print()
original: krajaan patut bagi pencen awal skt kpd warga emas supaya emosi
corrected: kerajaan patut bagi pencen awal sikit kepada warga emas supaya emosi
original: Husein ska mkn aym dkat kampng Jawa
corrected: Husein suka mkn ayam dekat kampung Jawa
original: Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.
corrected: Melayu malas ini narration dia sama sahaja macam men are trash . True to some , false to some .
original: Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.
corrected: Tapi tak fikir ke bahaya perpetuate myths macam itu . Nanti kalau ada hiring discrimination despite your good qualifications because of your race tahu pula marah . Your kids will be victims of that too .
original: DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager
corrected: DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager
original: blh bntg dlm kls nlp sy, nnti intch
corrected: boleh bintang dalam kelas nlp saya , nanti intch
original: mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik
corrected: mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik
[ ]: