KenLM#

This tutorial is available as an IPython notebook at Malaya/example/kenlm.

A very fast language model, accurate and non neural-network, https://github.com/kpu/kenlm

[1]:
import malaya
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

Dependency#

Make sure you already installed,

pip3 install pypi-kenlm==0.1.20210121

A simple python wrapper for original https://github.com/kpu/kenlm

List available KenLM models#

[2]:
malaya.language_model.available_kenlm
[2]:
{'bahasa-wiki': {'Size (MB)': 70.5,
  'LM order': 3,
  'Description': 'MS wikipedia.',
  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',
   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},
 'bahasa-news': {'Size (MB)': 107,
  'LM order': 3,
  'Description': 'local news.',
  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',
   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},
 'bahasa-wiki-news': {'Size (MB)': 165,
  'LM order': 3,
  'Description': 'MS wikipedia + local news.',
  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',
   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},
 'bahasa-wiki-news-iium-stt': {'Size (MB)': 416,
  'LM order': 3,
  'Description': 'MS wikipedia + local news + IIUM + STT',
  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',
   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},
 'dump-combined': {'Size (MB)': 310,
  'LM order': 3,
  'Description': 'Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl',
  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',
   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},
 'redape-community': {'Size (MB)': 887.1,
  'LM order': 4,
  'Description': 'Mirror for https://github.com/redapesolutions/suara-kami-community',
  'Command': ['./lmplz --text text.txt --arpa out.arpa -o 4 --prune 0 1 1 1',
   './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']}}

Load KenLM model#

def kenlm(model: str = 'dump-combined', **kwargs):
    """
    Load KenLM language model.

    Parameters
    ----------
    model: str, optional (default='dump-combined')
        Check available models at `malaya.language_model.available_kenlm`.
    Returns
    -------
    result: kenlm.Model class
    """
[3]:
model = malaya.language_model.kenlm()
[4]:
model.score('saya suke awak')
[4]:
-11.912322044372559
[5]:
model.score('saya suka awak')
[5]:
-6.80517053604126
[6]:
model.score('najib razak')
[6]:
-5.256608009338379
[7]:
model.score('najib comel')
[7]:
-10.580080032348633

Build custom Language Model#

  1. Build KenLM from source,

wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
  1. Prepare newlines text file. Feel free to use some from https://github.com/mesolitica/malaysian-dataset/tree/master/dumping,

kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1
kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm
  1. Once you have out.trie.klm, you can load to scorer interface,

import kenlm
model = kenlm.Model('out.trie.klm')
[ ]: