This tutorial is available as an IPython notebook at Malaya/example/kenlm.

A very fast language model, accurate and non neural-network,

import malaya
Make sure you already installed,

pip3 install pypi-kenlm==0.1.20210121

A simple python wrapper for original

List available KenLM models#

Size (MB) LM order Description Command
bahasa-news 107 3 local news. [./lmplz --text text.txt --arpa -o 3 ...
bahasa-wiki 70.5 3 MS wikipedia. [./lmplz --text text.txt --arpa -o 3 ...
bahasa-wiki-news 29 3 MS wikipedia + local news. [./lmplz --text text.txt --arpa -o 3 ...
redape-community 887.1 4 Mirror for [./lmplz --text text.txt --arpa -o 4 ...
dump-combined 310 3 Academia + News + IIUM + Parliament + Watpadd ... [./lmplz --text text.txt --arpa -o 3 ...

Load KenLM model#

def kenlm(model: str = 'dump-combined', **kwargs):
    Load KenLM language model.

    model: str, optional (default='dump-combined')
        Check available models at `malaya.language_model.available_models()`.
    result: kenlm.Model class
model = malaya.language_model.kenlm()
model.score('saya suke awak')
model.score('saya suka awak')
model.score('najib razak')
model.score('najib comel')

Build custom Language Model#

  1. Build KenLM from source,

wget -O - |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
  1. Prepare newlines text file. Feel free to use some from,

kenlm/build/bin/lmplz --text text.txt --arpa -o 3 --prune 0 1 1
kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.trie.klm
  1. Once you have out.trie.klm, you can load to scorer interface,

import kenlm
model = kenlm.Model('out.trie.klm')