{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# KenLM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/kenlm](https://github.com/huseinzol05/Malaya/tree/master/example/kenlm).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A very fast language model, accurate and non neural-network, https://github.com/kpu/kenlm" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dependency\n", "\n", "Make sure you already installed,\n", "\n", "```bash\n", "pip3 install pypi-kenlm==0.1.20210121\n", "```\n", "\n", "A simple python wrapper for original https://github.com/kpu/kenlm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available KenLM models" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'bahasa-wiki': {'Size (MB)': 70.5,\n", " 'LM order': 3,\n", " 'Description': 'MS wikipedia.',\n", " 'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',\n", " './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},\n", " 'bahasa-news': {'Size (MB)': 107,\n", " 'LM order': 3,\n", " 'Description': 'local news.',\n", " 'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',\n", " './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},\n", " 'bahasa-wiki-news': {'Size (MB)': 165,\n", " 'LM order': 3,\n", " 'Description': 'MS wikipedia + local news.',\n", " 'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',\n", " './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},\n", " 'bahasa-wiki-news-iium-stt': {'Size (MB)': 416,\n", " 'LM order': 3,\n", " 'Description': 'MS wikipedia + local news + IIUM + STT',\n", " 'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',\n", " './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},\n", " 'dump-combined': {'Size (MB)': 310,\n", " 'LM order': 3,\n", " 'Description': 'Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl',\n", " 'Command': ['./lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1',\n", " './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']},\n", " 'redape-community': {'Size (MB)': 887.1,\n", " 'LM order': 4,\n", " 'Description': 'Mirror for https://github.com/redapesolutions/suara-kami-community',\n", " 'Command': ['./lmplz --text text.txt --arpa out.arpa -o 4 --prune 0 1 1 1',\n", " './build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm']}}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.language_model.available_kenlm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load KenLM model\n", "\n", "```python\n", "def kenlm(model: str = 'dump-combined', **kwargs):\n", " \"\"\"\n", " Load KenLM language model.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='dump-combined')\n", " Check available models at `malaya.language_model.available_kenlm`.\n", " Returns\n", " -------\n", " result: kenlm.Model class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "model = malaya.language_model.kenlm()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-11.912322044372559" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score('saya suke awak')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-6.80517053604126" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score('saya suka awak')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-5.256608009338379" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score('najib razak')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-10.580080032348633" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score('najib comel')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build custom Language Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Build KenLM from source,\n", "\n", "```bash\n", "wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz\n", "mkdir kenlm/build\n", "cd kenlm/build\n", "cmake ..\n", "make -j2\n", "```\n", "\n", "2. Prepare newlines text file. Feel free to use some from https://github.com/mesolitica/malaysian-dataset/tree/master/dumping,\n", "\n", "```bash\n", "kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1\n", "kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm\n", "```\n", "\n", "3. Once you have out.trie.klm, you can load to scorer interface,\n", "\n", "```python\n", "import kenlm\n", "model = kenlm.Model('out.trie.klm')\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }