{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Language Detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/language-detection](https://github.com/huseinzol05/Malaya/tree/master/example/language-detection).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module trained on both standard and local (included social media) language structures, so it is save to use for both.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.72 s, sys: 1.14 s, total: 6.87 s\n", "Wall time: 8.29 s\n" ] } ], "source": [ "%%time\n", "import malaya\n", "import fasttext" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Models accuracy\n", "\n", "We use `sklearn.metrics.classification_report` for accuracy reporting, check at https://malaya.readthedocs.io/en/latest/models-accuracy.html#language-detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### labels supported\n", "\n", "Default labels for language detection module." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['eng', 'ind', 'malay', 'manglish', 'other', 'rojak']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.language_detection.label" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "chinese_text = '今天是6月18号,也是Muiriel的生日!'\n", "english_text = 'i totally love it man'\n", "indon_text = 'menjabat saleh perombakan menjabat periode komisi energi fraksi partai pengurus partai periode periode partai terpilih periode menjabat komisi perdagangan investasi persatuan periode'\n", "malay_text = 'beliau berkata program Inisitif Peduli Rakyat (IPR) yang diperkenalkan oleh kerajaan negeri Selangor lebih besar sumbangannya'\n", "socialmedia_malay_text = 'nti aku tengok dulu tiket dari kl pukul berapa ada nahh'\n", "socialmedia_indon_text = 'saking kangen papanya pas vc anakku nangis'\n", "rojak_text = 'jadi aku tadi bikin ini gengs dan dijual haha salad only k dan haha drinks only k'\n", "manglish_text = 'power lah even shopback come to edmw riao'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Fast-text model\n", "\n", "Make sure fast-text already installed, if not, simply,\n", "\n", "```bash\n", "pip install fasttext\n", "```\n", "\n", "```python\n", "def fasttext(quantized: bool = True, **kwargs):\n", "\n", " \"\"\"\n", " Load Fasttext language detection model.\n", " Original size is 353MB, Quantized size 31.1MB.\n", " \n", " Parameters\n", " ----------\n", " quantized: bool, optional (default=True)\n", " if True, load quantized fasttext model. Else, load original fasttext model.\n", "\n", " Returns\n", " -------\n", " result : malaya.model.ml.LanguageDetection class\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, I am going to compare with pretrained fasttext from Facebook. https://fasttext.cc/docs/en/language-identification.html\n", "\n", "Simply download pretrained model,\n", "\n", "```bash\n", "wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "model = fasttext.load_model('lid.176.ftz')\n", "fast_text = malaya.language_detection.fasttext()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([['__label__km']], array([[0.99841499]]))" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(['តើប្រព័ន្ធប្រតិបត្តិការណាដែលត្រូវគ្នាជាមួយកម្មវិធីធនាគារអេប៊ីអេ។'])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['other']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fast_text.predict(['តើប្រព័ន្ធប្រតិបត្តិការណាដែលត្រូវគ្នាជាមួយកម្មវិធីធនាគារអេប៊ីអេ។'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Language detection in Malaya is not trying to tackle possible languages in this world, just towards to hyperlocal language.**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "([['__label__id']], array([[0.6334154]]))" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(['suka makan ayam dan daging'])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 0.0,\n", " 'ind': 0.0,\n", " 'malay': 0.8817721009254456,\n", " 'manglish': 0.0,\n", " 'other': 0.0,\n", " 'rojak': 0.0}]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fast_text.predict_proba(['suka makan ayam dan daging'])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('__label__ms',), array([0.57101035]))" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(malay_text)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 0.0,\n", " 'ind': 0.0,\n", " 'malay': 0.9999504089355469,\n", " 'manglish': 0.0,\n", " 'other': 0.0,\n", " 'rojak': 0.0}]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fast_text.predict_proba([malay_text])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('__label__id',), array([0.7870034]))" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(socialmedia_malay_text)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 0.0,\n", " 'ind': 0.0,\n", " 'malay': 0.9996305704116821,\n", " 'manglish': 0.0,\n", " 'other': 0.0,\n", " 'rojak': 0.0}]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fast_text.predict_proba([socialmedia_malay_text])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('__label__fr',), array([0.2912012]))" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(socialmedia_indon_text)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 0.0,\n", " 'ind': 1.0000293254852295,\n", " 'malay': 0.0,\n", " 'manglish': 0.0,\n", " 'other': 0.0,\n", " 'rojak': 0.0}]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fast_text.predict_proba([socialmedia_indon_text])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('__label__id',), array([0.87948251]))" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(rojak_text)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 0.0,\n", " 'ind': 0.0,\n", " 'malay': 0.0,\n", " 'manglish': 0.0,\n", " 'other': 0.0,\n", " 'rojak': 0.9994134306907654}]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fast_text.predict_proba([rojak_text])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('__label__en',), array([0.89707506]))" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(manglish_text)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 0.0,\n", " 'ind': 0.0,\n", " 'malay': 0.0,\n", " 'manglish': 1.00004243850708,\n", " 'other': 0.0,\n", " 'rojak': 0.0}]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fast_text.predict_proba([manglish_text])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(('__label__zh',), array([0.97311586]))" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(chinese_text)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "[{'eng': 0.0,\n", " 'ind': 0.0,\n", " 'malay': 0.0,\n", " 'manglish': 0.0,\n", " 'other': 0.9921814203262329,\n", " 'rojak': 0.0}]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fast_text.predict_proba([chinese_text])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 0.0,\n", " 'ind': 1.0000287294387817,\n", " 'malay': 0.0,\n", " 'manglish': 0.0,\n", " 'other': 0.0,\n", " 'rojak': 0.0},\n", " {'eng': 0.0,\n", " 'ind': 0.0,\n", " 'malay': 0.9999504089355469,\n", " 'manglish': 0.0,\n", " 'other': 0.0,\n", " 'rojak': 0.0}]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fast_text.predict_proba([indon_text,malay_text])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Deep learning model\n", "\n", "```python\n", "def deep_model(quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load deep learning language detection model.\n", " Original size is 51.2MB, Quantized size 12.8MB.\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya.model.tf.DeepLang class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "deep = malaya.language_detection.deep_model()\n", "quantized_deep = malaya.language_detection.deep_model(quantized = True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 3.6145184e-06,\n", " 'ind': 0.9998913,\n", " 'malay': 5.4685424e-05,\n", " 'manglish': 5.768742e-09,\n", " 'other': 5.8103424e-06,\n", " 'rojak': 4.4987162e-05}]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "deep.predict_proba([indon_text])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 3.6145184e-06,\n", " 'ind': 0.9998913,\n", " 'malay': 5.4685424e-05,\n", " 'manglish': 5.768742e-09,\n", " 'other': 5.8103424e-06,\n", " 'rojak': 4.4987162e-05}]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_deep.predict_proba([indon_text])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 9.500837e-11,\n", " 'ind': 0.0004703698,\n", " 'malay': 0.9991295,\n", " 'manglish': 1.602048e-13,\n", " 'other': 1.9133091e-07,\n", " 'rojak': 0.0004000054}]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "deep.predict_proba([malay_text])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 9.500829e-11,\n", " 'ind': 0.00047036994,\n", " 'malay': 0.99912965,\n", " 'manglish': 1.6020499e-13,\n", " 'other': 1.9133095e-07,\n", " 'rojak': 0.00040000546}]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_deep.predict_proba([malay_text])" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 3.6145207e-06,\n", " 'ind': 0.9998909,\n", " 'malay': 5.468535e-05,\n", " 'manglish': 5.7687397e-09,\n", " 'other': 5.8103406e-06,\n", " 'rojak': 4.4987148e-05},\n", " {'eng': 9.500837e-11,\n", " 'ind': 0.0004703698,\n", " 'malay': 0.9991295,\n", " 'manglish': 1.602048e-13,\n", " 'other': 1.9133091e-07,\n", " 'rojak': 0.0004000056}]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "deep.predict_proba([indon_text,malay_text])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 3.614522e-06,\n", " 'ind': 0.9998913,\n", " 'malay': 5.4685373e-05,\n", " 'manglish': 5.768742e-09,\n", " 'other': 5.8103424e-06,\n", " 'rojak': 4.4987162e-05},\n", " {'eng': 9.500829e-11,\n", " 'ind': 0.00047036994,\n", " 'malay': 0.99912965,\n", " 'manglish': 1.6020499e-13,\n", " 'other': 1.9133095e-07,\n", " 'rojak': 0.0004000057}]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_deep.predict_proba([indon_text,malay_text])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 1.4520887e-09,\n", " 'ind': 0.0064318455,\n", " 'malay': 0.9824693,\n", " 'manglish': 2.1923141e-13,\n", " 'other': 1.06363805e-05,\n", " 'rojak': 0.0110881105}]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "deep.predict_proba([socialmedia_malay_text])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 1.4520903e-09,\n", " 'ind': 0.006431847,\n", " 'malay': 0.98246956,\n", " 'manglish': 2.1923168e-13,\n", " 'other': 1.0636383e-05,\n", " 'rojak': 0.011088113}]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_deep.predict_proba([socialmedia_malay_text])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 4.0632068e-07,\n", " 'ind': 0.9999995,\n", " 'malay': 6.871639e-10,\n", " 'manglish': 7.4285925e-11,\n", " 'other': 1.5928721e-07,\n", " 'rojak': 4.892652e-10}]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "deep.predict_proba([socialmedia_indon_text])" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'eng': 0.0040922514,\n", " 'ind': 0.02200061,\n", " 'malay': 0.0027574676,\n", " 'manglish': 9.336553e-06,\n", " 'other': 0.00023811469,\n", " 'rojak': 0.97090226},\n", " {'eng': 9.500837e-11,\n", " 'ind': 0.0004703698,\n", " 'malay': 0.9991295,\n", " 'manglish': 1.602048e-13,\n", " 'other': 1.9133091e-07,\n", " 'rojak': 0.0004000056}]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "deep.predict_proba([rojak_text, malay_text])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 2 }