KenLM
Contents
KenLM#
This tutorial is available as an IPython notebook at Malaya/example/kenlm.
A very fast language model, accurate and non neural-network, https://github.com/kpu/kenlm
[1]:
import malaya
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
Dependency#
Make sure you already installed,
pip3 install pypi-kenlm==0.1.20210121
A simple python wrapper for original https://github.com/kpu/kenlm
List available KenLM models#
[2]:
malaya.language_model.available_kenlm
[2]:
{'bahasa-wiki': {'Size (MB)': 70.5,
'LM order': 3,
'Description': 'MS wikipedia.',
'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',
'./build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},
'bahasa-news': {'Size (MB)': 107,
'LM order': 3,
'Description': 'local news.',
'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',
'./build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},
'bahasa-wiki-news': {'Size (MB)': 165,
'LM order': 3,
'Description': 'MS wikipedia + local news.',
'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',
'./build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},
'bahasa-wiki-news-iium-stt': {'Size (MB)': 416,
'LM order': 3,
'Description': 'MS wikipedia + local news + IIUM + STT',
'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',
'./build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},
'dump-combined': {'Size (MB)': 310,
'LM order': 3,
'Description': 'Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl',
'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',
'./build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},
'redape-community': {'Size (MB)': 887.1,
'LM order': 4,
'Description': 'Mirror for https://github.com/redapesolutions/suara-kami-community',
'Command': ['./lmplz --text text.txt --arpa out.arpa -o 4 --prune 0 1 1 1',
'./build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']}}
Load KenLM model#
def kenlm(model: str = 'dump-combined', **kwargs):
"""
Load KenLM language model.
Parameters
----------
model: str, optional (default='dump-combined')
Check available models at `malaya.language_model.available_kenlm`.
Returns
-------
result: kenlm.Model class
"""
[3]:
model = malaya.language_model.kenlm()
[4]:
model.score('saya suke awak')
[4]:
-11.912322044372559
[5]:
model.score('saya suka awak')
[5]:
-6.80517053604126
[6]:
model.score('najib razak')
[6]:
-5.256608009338379
[7]:
model.score('najib comel')
[7]:
-10.580080032348633
Build custom Language Model#
Build KenLM from source,
wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
Prepare newlines text file. Feel free to use some from https://github.com/mesolitica/malaysian-dataset/tree/master/dumping,
kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1
kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm
Once you have out.trie.klm, you can load to scorer interface,
import kenlm
model = kenlm.Model('out.trie.klm')
[ ]: