Masked LM#

This tutorial is available as an IPython notebook at Malaya/example/mlm.

Masked Language Model Scoring,

We are able to use BERT, ALBERT, RoBERTa and DeBERTa-v2 from HuggingFace to do text scoring.

import malaya
/home/husein/dev/malaya/malaya/ FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/ FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

List available MLM models#

{'mesolitica/bert-base-standard-bahasa-cased': {'Size (MB)': 310},
 'mesolitica/bert-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},
 'mesolitica/roberta-base-standard-bahasa-cased': {'Size (MB)': 443},
 'mesolitica/roberta-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},
 'mesolitica/malaysian-debertav2-base': {'Size (MB)': 228}}

Load MLM model#

def mlm(
    model: str = 'mesolitica/bert-tiny-standard-bahasa-cased',
    force_check: bool = True,
    Load Masked language model.

    model: str, optional (default='mesolitica/bert-tiny-standard-bahasa-cased')
        Check available models at `malaya.language_model.available_mlm`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    result: malaya.torch_model.mask_lm.MLMScorer class
model = malaya.language_model.mlm(model = 'mesolitica/bert-tiny-standard-bahasa-cased')
model.score('saya suke awak')
model.score('saya suka awak')
model.score('najib razak')
model.score('najib comel')