{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# KenLM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/kenlm](https://github.com/huseinzol05/Malaya/tree/master/example/kenlm).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A very fast language model, accurate and non neural-network, https://github.com/kpu/kenlm" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.3.0 and strictly below 2.5.0 (nightly versions are not supported). \n", " The versions of TensorFlow you are currently using is 2.6.0 and is not supported. \n", "Some things might work, some things might not.\n", "If you were to encounter a bug, do not file an issue.\n", "If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. \n", "You can find the compatibility matrix in TensorFlow Addon's readme:\n", "https://github.com/tensorflow/addons\n", " warnings.warn(\n", "/home/husein/.local/lib/python3.8/site-packages/tensorflow_addons/utils/resource_loader.py:72: UserWarning: You are currently using TensorFlow 2.6.0 and trying to load a custom op (custom_ops/seq2seq/_beam_search_ops.so).\n", "TensorFlow Addons has compiled its custom ops against TensorFlow 2.4.0, and there are no compatibility guarantees between the two versions. \n", "This means that you might get segfaults when loading the custom op, or other kind of low-level errors.\n", " If you do, do not file an issue on Github. This is a known limitation.\n", "\n", "It might help you to fallback to pure Python ops with TF_ADDONS_PY_OPS . To do that, see https://github.com/tensorflow/addons#gpucpu-custom-ops \n", "\n", "You can also change the TensorFlow version installed on your system. You would need a TensorFlow version equal to or above 2.4.0 and strictly below 2.5.0.\n", " Note that nightly versions of TensorFlow, as well as non-pip TensorFlow like `conda install tensorflow` or compiled from source are not supported.\n", "\n", "The last solution is to find the TensorFlow Addons version that has custom ops compatible with the TensorFlow installed on your system. To do that, refer to the readme: https://github.com/tensorflow/addons\n", " warnings.warn(\n", "/home/husein/dev/malaya-boilerplate/malaya_boilerplate/frozen_graph.py:35: UserWarning: Cannot import beam_search_ops from Tensorflow Addons, ['malaya.jawi_rumi.deep_model', 'malaya.phoneme.deep_model', 'malaya.rumi_jawi.deep_model', 'malaya.stem.deep_model'] will not available to use, make sure Tensorflow Addons version >= 0.12.0\n", " warnings.warn(\n", "/home/husein/dev/malaya-boilerplate/malaya_boilerplate/frozen_graph.py:38: UserWarning: check compatible Tensorflow version with Tensorflow Addons at https://github.com/tensorflow/addons/releases\n", " warnings.warn(\n", "/home/husein/dev/malaya/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dependency\n", "\n", "Make sure you already installed,\n", "\n", "```bash\n", "pip3 install pypi-kenlm==0.1.20210121\n", "```\n", "\n", "A simple python wrapper for original https://github.com/kpu/kenlm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available KenLM models" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)LM orderDescriptionCommand
bahasa-news1073local news.[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-wiki70.53MS wikipedia.[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-wiki-news293MS wikipedia + local news.[./lmplz --text text.txt --arpa out.arpa -o 3 ...
redape-community887.14Mirror for https://github.com/redapesolutions/...[./lmplz --text text.txt --arpa out.arpa -o 4 ...
dump-combined3103Academia + News + IIUM + Parliament + Watpadd ...[./lmplz --text text.txt --arpa out.arpa -o 3 ...
\n", "
" ], "text/plain": [ " Size (MB) LM order \\\n", "bahasa-news 107 3 \n", "bahasa-wiki 70.5 3 \n", "bahasa-wiki-news 29 3 \n", "redape-community 887.1 4 \n", "dump-combined 310 3 \n", "\n", " Description \\\n", "bahasa-news local news. \n", "bahasa-wiki MS wikipedia. \n", "bahasa-wiki-news MS wikipedia + local news. \n", "redape-community Mirror for https://github.com/redapesolutions/... \n", "dump-combined Academia + News + IIUM + Parliament + Watpadd ... \n", "\n", " Command \n", "bahasa-news [./lmplz --text text.txt --arpa out.arpa -o 3 ... \n", "bahasa-wiki [./lmplz --text text.txt --arpa out.arpa -o 3 ... \n", "bahasa-wiki-news [./lmplz --text text.txt --arpa out.arpa -o 3 ... \n", "redape-community [./lmplz --text text.txt --arpa out.arpa -o 4 ... \n", "dump-combined [./lmplz --text text.txt --arpa out.arpa -o 3 ... " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.language_model.available_kenlm()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load KenLM model\n", "\n", "```python\n", "def kenlm(model: str = 'dump-combined', **kwargs):\n", " \"\"\"\n", " Load KenLM language model.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='dump-combined')\n", " Check available models at `malaya.language_model.available_models()`.\n", " Returns\n", " -------\n", " result: kenlm.Model class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "model = malaya.language_model.kenlm()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-11.912322044372559" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score('saya suke awak')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-6.80517053604126" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score('saya suka awak')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-5.256608009338379" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score('najib razak')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-10.580080032348633" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score('najib comel')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build custom Language Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Build KenLM from source,\n", "\n", "```bash\n", "wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz\n", "mkdir kenlm/build\n", "cd kenlm/build\n", "cmake ..\n", "make -j2\n", "```\n", "\n", "2. Prepare newlines text file. Feel free to use some from https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping,\n", "\n", "```bash\n", "kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1\n", "kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm\n", "```\n", "\n", "3. Once you have out.trie.klm, you can load to scorer interface,\n", "\n", "```python\n", "import kenlm\n", "model = kenlm.Model('out.trie.klm')\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }