Masked LM#

This tutorial is available as an IPython notebook at Malaya/example/mlm.

Masked Language Model Scoring, https://arxiv.org/abs/1910.14659

We are able to use BERT, ALBERT, RoBERTa and DeBERTa-v2 from HuggingFace to do text scoring.

[1]:
import malaya
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

List available MLM models#

[2]:
malaya.language_model.available_mlm
[2]:
{'mesolitica/bert-base-standard-bahasa-cased': {'Size (MB)': 310},
 'mesolitica/bert-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},
 'mesolitica/roberta-base-standard-bahasa-cased': {'Size (MB)': 443},
 'mesolitica/roberta-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},
 'mesolitica/malaysian-debertav2-base': {'Size (MB)': 228}}

Load MLM model#

def mlm(
    model: str = 'mesolitica/bert-tiny-standard-bahasa-cased',
    force_check: bool = True,
    **kwargs
):
    """
    Load Masked language model.

    Parameters
    ----------
    model: str, optional (default='mesolitica/bert-tiny-standard-bahasa-cased')
        Check available models at `malaya.language_model.available_mlm`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.mask_lm.MLMScorer class
    """
[9]:
model = malaya.language_model.mlm(model = 'mesolitica/bert-tiny-standard-bahasa-cased')
[10]:
model.score('saya suke awak')
[10]:
-28.428839
[11]:
model.score('saya suka awak')
[11]:
-11.658715
[12]:
model.score('najib razak')
[12]:
-1.1320121
[13]:
model.score('najib comel')
[13]:
-19.881565