Masked LM
Contents
Masked LM#
This tutorial is available as an IPython notebook at Malaya/example/mlm.
Masked Language Model Scoring, https://arxiv.org/abs/1910.14659
We are able to use BERT, ALBERT, RoBERTa and DeBERTa-v2 from HuggingFace to do text scoring.
[1]:
import malaya
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
List available MLM models#
[2]:
malaya.language_model.available_mlm
[2]:
{'mesolitica/bert-base-standard-bahasa-cased': {'Size (MB)': 310},
'mesolitica/bert-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},
'mesolitica/roberta-base-standard-bahasa-cased': {'Size (MB)': 443},
'mesolitica/roberta-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},
'mesolitica/malaysian-debertav2-base': {'Size (MB)': 228}}
Load MLM model#
def mlm(
model: str = 'mesolitica/bert-tiny-standard-bahasa-cased',
force_check: bool = True,
**kwargs
):
"""
Load Masked language model.
Parameters
----------
model: str, optional (default='mesolitica/bert-tiny-standard-bahasa-cased')
Check available models at `malaya.language_model.available_mlm`.
force_check: bool, optional (default=True)
Force check model one of malaya model.
Set to False if you have your own huggingface model.
Returns
-------
result: malaya.torch_model.mask_lm.MLMScorer class
"""
[9]:
model = malaya.language_model.mlm(model = 'mesolitica/bert-tiny-standard-bahasa-cased')
[10]:
model.score('saya suke awak')
[10]:
-28.428839
[11]:
model.score('saya suka awak')
[11]:
-11.658715
[12]:
model.score('najib razak')
[12]:
-1.1320121
[13]:
model.score('najib comel')
[13]:
-19.881565