{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Language Detection word level using rules based" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/language-detection-words](https://github.com/huseinzol05/Malaya/tree/master/example/language-detection-words).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.87 s, sys: 3.85 s, total: 6.73 s\n", "Wall time: 1.97 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install pyenchant\n", "\n", "pyenchant is an optional, full installation steps at https://pyenchant.github.io/pyenchant/install.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load model\n", "\n", "```python\n", "def substring_rules(model, **kwargs):\n", " \"\"\"\n", " detect EN, MS and OTHER languages in a string.\n", "\n", " EN words detection are using `pyenchant` from https://pyenchant.github.io/pyenchant/ and\n", " user language detection model.\n", "\n", " MS words detection are using `malaya.dictionary.is_malay` and\n", " user language detection model.\n", "\n", " OTHER words detection are using any language detection classification model, such as,\n", " `malaya.language_detection.fasttext`.\n", "\n", " Parameters\n", " ----------\n", " model : Callable\n", " Callable model, must have `predict` method.\n", "\n", " Returns\n", " -------\n", " result : malaya.model.rules.LanguageDict class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.\n" ] } ], "source": [ "fasttext = malaya.language_detection.fasttext()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "model = malaya.language_detection.substring_rules(model = fasttext)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predict\n", "\n", "```python\n", "def predict(\n", " self,\n", " words: List[str],\n", " acceptable_ms_label: List[str] = ['malay', 'ind'],\n", " acceptable_en_label: List[str] = ['eng', 'manglish'],\n", " use_is_malay: bool = True,\n", "):\n", " \"\"\"\n", " Predict [EN, MS, OTHERS, CAPITAL, NOT_LANG] on word level. \n", " This method assumed the string already tokenized.\n", "\n", " Parameters\n", " ----------\n", " words: List[str]\n", " acceptable_ms_label: List[str], optional (default = ['malay', 'ind'])\n", " accept labels from language detection model to assume a word is `MS`.\n", " acceptable_en_label: List[str], optional (default = ['eng', 'manglish'])\n", " accept labels from language detection model to assume a word is `EN`.\n", " use_is_malay: bool, optional (default=True)\n", " if True`, will predict MS word using `malaya.dictionary.is_malay`, \n", " else use language detection model.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('saya', 'MS'),\n", " ('suka', 'MS'),\n", " ('chicken', 'EN'),\n", " ('and', 'EN'),\n", " ('fish', 'EN'),\n", " ('pda', 'MS'),\n", " ('hari', 'MS'),\n", " ('isnin', 'MS')]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = 'saya suka chicken and fish pda hari isnin'\n", "splitted = string.split()\n", "list(zip(splitted, model.predict(splitted)))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('saya', 'MS'),\n", " ('suka', 'MS'),\n", " ('chicken', 'EN'),\n", " ('and', 'EN'),\n", " ('fish', 'EN'),\n", " ('pda', 'MS'),\n", " ('hari', 'MS'),\n", " ('isnin', 'MS'),\n", " (',', 'NOT_LANG'),\n", " ('tarikh', 'MS'),\n", " ('22', 'NOT_LANG'),\n", " ('mei', 'MS')]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = 'saya suka chicken and fish pda hari isnin , tarikh 22 mei'\n", "splitted = string.split()\n", "list(zip(splitted, model.predict(splitted)))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('saya', 'MS'),\n", " ('suka', 'MS'),\n", " ('chicken', 'EN'),\n", " ('🐔', 'NOT_LANG'),\n", " ('and', 'EN'),\n", " ('fish', 'EN'),\n", " ('pda', 'MS'),\n", " ('hari', 'MS'),\n", " ('isnin', 'MS'),\n", " (',', 'NOT_LANG'),\n", " ('tarikh', 'MS'),\n", " ('22', 'NOT_LANG'),\n", " ('mei', 'MS')]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = 'saya suka chicken 🐔 and fish pda hari isnin , tarikh 22 mei'\n", "splitted = string.split()\n", "list(zip(splitted, model.predict(splitted)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use malaya.preprocessing.Tokenizer\n", "\n", "To get better word tokens!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "string = 'Terminal 1 KKIA dilengkapi kemudahan 64 kaunter daftar masuk, 12 aero bridge selain mampu menampung 3,200 penumpang dalam satu masa.'" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Terminal',\n", " '1',\n", " 'KKIA',\n", " 'dilengkapi',\n", " 'kemudahan',\n", " '64',\n", " 'kaunter',\n", " 'daftar',\n", " 'masuk',\n", " ',',\n", " '12',\n", " 'aero',\n", " 'bridge',\n", " 'selain',\n", " 'mampu',\n", " 'menampung',\n", " '3,200',\n", " 'penumpang',\n", " 'dalam',\n", " 'satu',\n", " 'masa',\n", " '.']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer = malaya.preprocessing.Tokenizer()\n", "tokenized = tokenizer.tokenize(string)\n", "tokenized" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[('Terminal', 'MS'),\n", " ('1', 'NOT_LANG'),\n", " ('KKIA', 'CAPITAL'),\n", " ('dilengkapi', 'MS'),\n", " ('kemudahan', 'MS'),\n", " ('64', 'NOT_LANG'),\n", " ('kaunter', 'MS'),\n", " ('daftar', 'MS'),\n", " ('masuk', 'MS'),\n", " (',', 'NOT_LANG'),\n", " ('12', 'NOT_LANG'),\n", " ('aero', 'OTHERS'),\n", " ('bridge', 'EN'),\n", " ('selain', 'MS'),\n", " ('mampu', 'MS'),\n", " ('menampung', 'MS'),\n", " ('3,200', 'NOT_LANG'),\n", " ('penumpang', 'MS'),\n", " ('dalam', 'MS'),\n", " ('satu', 'MS'),\n", " ('masa', 'MS'),\n", " ('.', 'NOT_LANG')]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(zip(tokenized, model.predict(tokenized)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If not properly tokenized the string," ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[('Terminal', 'MS'),\n", " ('1', 'NOT_LANG'),\n", " ('KKIA', 'CAPITAL'),\n", " ('dilengkapi', 'MS'),\n", " ('kemudahan', 'MS'),\n", " ('64', 'NOT_LANG'),\n", " ('kaunter', 'MS'),\n", " ('daftar', 'MS'),\n", " ('masuk,', 'OTHERS'),\n", " ('12', 'NOT_LANG'),\n", " ('aero', 'OTHERS'),\n", " ('bridge', 'EN'),\n", " ('selain', 'MS'),\n", " ('mampu', 'MS'),\n", " ('menampung', 'MS'),\n", " ('3,200', 'NOT_LANG'),\n", " ('penumpang', 'MS'),\n", " ('dalam', 'MS'),\n", " ('satu', 'MS'),\n", " ('masa.', 'OTHERS')]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splitted = string.split()\n", "list(zip(splitted, model.predict(splitted)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More example\n", "\n", "Copy pasted from Twitter." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "s = \"just attended my cousin's wedding. pelik jugak dia buat majlis biasa2 je sebab her lifestyle looks lavish. then i found out they're going on a 3 weeks honeymoon. smart decision 👍\"" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('just', 'EN'),\n", " ('attended', 'EN'),\n", " ('my', 'EN'),\n", " (\"cousin's\", 'EN'),\n", " ('wedding', 'EN'),\n", " ('.', 'NOT_LANG'),\n", " ('pelik', 'MS'),\n", " ('jugak', 'MS'),\n", " ('dia', 'MS'),\n", " ('buat', 'MS'),\n", " ('majlis', 'MS'),\n", " ('biasa2', 'OTHERS'),\n", " ('je', 'MS'),\n", " ('sebab', 'MS'),\n", " ('her', 'EN'),\n", " ('lifestyle', 'EN'),\n", " ('looks', 'EN'),\n", " ('lavish', 'EN'),\n", " ('.', 'NOT_LANG'),\n", " ('then', 'EN'),\n", " ('i', 'MS'),\n", " ('found', 'EN'),\n", " ('out', 'EN'),\n", " (\"they'\", 'OTHERS'),\n", " ('re', 'EN'),\n", " ('going', 'EN'),\n", " ('on', 'EN'),\n", " ('a', 'EN'),\n", " ('3', 'NOT_LANG'),\n", " ('weeks', 'EN'),\n", " ('honeymoon', 'EN'),\n", " ('.', 'NOT_LANG'),\n", " ('smart', 'EN'),\n", " ('decision', 'EN'),\n", " ('👍', 'NOT_LANG')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized = tokenizer.tokenize(s)\n", "list(zip(tokenized, model.predict(tokenized)))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Hello', 'EN'),\n", " ('gais', 'MS'),\n", " (',', 'NOT_LANG'),\n", " ('boleh', 'MS'),\n", " ('tolong', 'MS'),\n", " ('recommend', 'EN'),\n", " ('bengkel', 'MS'),\n", " ('ketuk', 'MS'),\n", " ('yang', 'MS'),\n", " ('okay', 'EN'),\n", " ('near', 'EN'),\n", " ('Wangsa', 'MS'),\n", " ('Maju', 'MS'),\n", " ('/', 'NOT_LANG'),\n", " ('nearby', 'EN'),\n", " ('?', 'NOT_LANG'),\n", " ('Kereta', 'MS'),\n", " ('bf', 'MS'),\n", " ('i', 'MS'),\n", " ('pulak', 'MS'),\n", " ('kepek', 'MS'),\n", " ('langgar', 'MS'),\n", " ('dinding', 'MS'),\n", " ('hahahha', 'NOT_LANG')]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = 'Hello gais, boleh tolong recommend bengkel ketuk yang okay near Wangsa Maju / nearby? Kereta bf i pulak kepek langgar dinding hahahha'\n", "tokenized = tokenizer.tokenize(s)\n", "list(zip(tokenized, model.predict(tokenized)))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Me', 'EN'),\n", " ('after', 'EN'),\n", " ('seeing', 'EN'),\n", " ('this', 'EN'),\n", " ('video', 'MS'),\n", " (':', 'NOT_LANG'),\n", " ('mm', 'EN'),\n", " ('dapnya', 'MS'),\n", " ('burger', 'MS'),\n", " ('benjo', 'OTHERS'),\n", " ('extra', 'EN'),\n", " ('mayo', 'EN')]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = 'Me after seeing this video: mm dapnya burger benjo extra mayo'\n", "tokenized = tokenizer.tokenize(s)\n", "list(zip(tokenized, model.predict(tokenized)))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Hi', 'EN'),\n", " ('guys', 'EN'),\n", " ('!', 'NOT_LANG'),\n", " ('I', 'CAPITAL'),\n", " ('noticed', 'EN'),\n", " ('semalam', 'MS'),\n", " ('&', 'NOT_LANG'),\n", " ('harini', 'MS'),\n", " ('dah', 'MS'),\n", " ('ramai', 'MS'),\n", " ('yang', 'MS'),\n", " ('dapat', 'MS'),\n", " ('cookies', 'EN'),\n", " ('ni', 'MS'),\n", " ('kan', 'MS'),\n", " ('.', 'NOT_LANG'),\n", " ('So', 'MS'),\n", " ('harini', 'MS'),\n", " ('i', 'MS'),\n", " ('nak', 'MS'),\n", " ('share', 'EN'),\n", " ('some', 'EN'),\n", " ('post', 'MS'),\n", " ('mortem', 'MS'),\n", " ('of', 'EN'),\n", " ('our', 'EN'),\n", " ('first', 'EN'),\n", " ('batch', 'EN'),\n", " (':', 'NOT_LANG')]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = 'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:'\n", "tokenized = tokenizer.tokenize(s)\n", "list(zip(tokenized, model.predict(tokenized)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }