{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Relevancy Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/relevancy](https://github.com/huseinzol05/Malaya/tree/master/example/relevancy).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module only trained on standard language structure, so it is not save to use it for local language structure.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:numexpr.utils:NumExpr defaulting to 8 threads.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.8 s, sys: 1.1 s, total: 6.9 s\n", "Wall time: 7.91 s\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### labels supported\n", "\n", "Default labels for relevancy module." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['not relevant', 'relevant']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.relevancy.label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explanation\n", "\n", "Positive relevancy: The article or piece of text is relevant, tendency is high to become not a fake news. Can be a positive or negative sentiment.\n", "\n", "Negative relevancy: The article or piece of text is not relevant, tendency is high to become a fake news. Can be a positive or negative sentiment.\n", "\n", "**Right now relevancy module only support deep learning model**." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "negative_text = 'Roti Massimo Mengandungi DNA Babi. Roti produk Massimo keluaran Syarikat The Italian Baker mengandungi DNA babi. Para pengguna dinasihatkan supaya tidak memakan produk massimo. Terdapat pelbagai produk roti keluaran syarikat lain yang boleh dimakan dan halal. Mari kita sebarkan berita ini supaya semua rakyat Malaysia sedar dengan apa yang mereka makna setiap hari. Roti tidak halal ada DNA babi jangan makan ok.'\n", "positive_text = 'Jabatan Kemajuan Islam Malaysia memperjelaskan dakwaan sebuah mesej yang dikitar semula, yang mendakwa kononnya kod E dikaitkan dengan kandungan lemak babi sepertimana yang tular di media sosial. . Tular: November 2017 . Tular: Mei 2014 JAKIM ingin memaklumkan kepada masyarakat berhubung maklumat yang telah disebarkan secara meluas khasnya melalui media sosial berhubung kod E yang dikaitkan mempunyai lemak babi. Untuk makluman, KOD E ialah kod untuk bahan tambah (aditif) dan ianya selalu digunakan pada label makanan di negara Kesatuan Eropah. Menurut JAKIM, tidak semua nombor E yang digunakan untuk membuat sesuatu produk makanan berasaskan dari sumber yang haram. Sehubungan itu, sekiranya sesuatu produk merupakan produk tempatan dan mendapat sijil Pengesahan Halal Malaysia, maka ia boleh digunakan tanpa was-was sekalipun mempunyai kod E-kod. Tetapi sekiranya produk tersebut bukan produk tempatan serta tidak mendapat sijil pengesahan halal Malaysia walaupun menggunakan e-kod yang sama, pengguna dinasihatkan agar berhati-hati dalam memilih produk tersebut.'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available Transformer models" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.relevancy:trained on 90% dataset, tested on another 10% test set, dataset at https://github.com/huseinzol05/malaya/blob/master/session/relevancy/download-data.ipynb\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)macro precisionmacro recallmacro f1-scoremax length
bert425.6111.000.893200.891950.89256512.0
tiny-bert57.415.400.871790.863240.86695512.0
albert48.612.800.897980.860080.87209512.0
tiny-albert22.45.980.821570.834100.82416512.0
xlnet446.6118.000.927070.921030.92381512.0
alxlnet46.813.300.911350.904460.90758512.0
bigbird458.0116.000.880930.868320.873521024.0
tiny-bigbird65.016.900.865580.858710.861761024.0
fastformer458.0116.000.923870.910640.916162048.0
tiny-fastformer77.319.700.856550.863370.859252048.0
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) macro precision \\\n", "bert 425.6 111.00 0.89320 \n", "tiny-bert 57.4 15.40 0.87179 \n", "albert 48.6 12.80 0.89798 \n", "tiny-albert 22.4 5.98 0.82157 \n", "xlnet 446.6 118.00 0.92707 \n", "alxlnet 46.8 13.30 0.91135 \n", "bigbird 458.0 116.00 0.88093 \n", "tiny-bigbird 65.0 16.90 0.86558 \n", "fastformer 458.0 116.00 0.92387 \n", "tiny-fastformer 77.3 19.70 0.85655 \n", "\n", " macro recall macro f1-score max length \n", "bert 0.89195 0.89256 512.0 \n", "tiny-bert 0.86324 0.86695 512.0 \n", "albert 0.86008 0.87209 512.0 \n", "tiny-albert 0.83410 0.82416 512.0 \n", "xlnet 0.92103 0.92381 512.0 \n", "alxlnet 0.90446 0.90758 512.0 \n", "bigbird 0.86832 0.87352 1024.0 \n", "tiny-bigbird 0.85871 0.86176 1024.0 \n", "fastformer 0.91064 0.91616 2048.0 \n", "tiny-fastformer 0.86337 0.85925 2048.0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.relevancy.available_transformer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Transformer model\n", "\n", "```python\n", "def transformer(model: str = 'xlnet', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load Transformer relevancy model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='bert')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'bert'`` - Google BERT BASE parameters.\n", " * ``'tiny-bert'`` - Google BERT TINY parameters.\n", " * ``'albert'`` - Google ALBERT BASE parameters.\n", " * ``'tiny-albert'`` - Google ALBERT TINY parameters.\n", " * ``'xlnet'`` - Google XLNET BASE parameters.\n", " * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.\n", " * ``'bigbird'`` - Google BigBird BASE parameters.\n", " * ``'tiny-bigbird'`` - Malaya BigBird BASE parameters.\n", " * ``'fastformer'`` - FastFormer BASE parameters.\n", " * ``'tiny-fastformer'`` - FastFormer TINY parameters.\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result: model\n", " List of model classes:\n", "\n", " * if `bert` in model, will return `malaya.model.bert.MulticlassBERT`.\n", " * if `xlnet` in model, will return `malaya.model.xlnet.MulticlassXLNET`.\n", " * if `bigbird` in model, will return `malaya.model.xlnet.MulticlassBigBird`.\n", " * if `fastformer` in model, will return `malaya.model.fastformer.MulticlassFastFormer`.\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.\n", "\n" ] } ], "source": [ "model = malaya.relevancy.transformer(model = 'tiny-bigbird')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "quantized_model = malaya.relevancy.transformer(model = 'alxlnet', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict batch of strings\n", "\n", "```python\n", "def predict(self, strings: List[str]):\n", " \"\"\"\n", " classify list of strings.\n", "\n", " Parameters\n", " ----------\n", " strings: List[str]\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.04 s, sys: 520 ms, total: 2.56 s\n", "Wall time: 1.23 s\n" ] }, { "data": { "text/plain": [ "['not relevant', 'relevant']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.predict([negative_text, positive_text])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.08 s, sys: 823 ms, total: 5.91 s\n", "Wall time: 2.96 s\n" ] }, { "data": { "text/plain": [ "['not relevant', 'relevant']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.predict([negative_text, positive_text])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict batch of strings with probability\n", "\n", "```python\n", "def predict_proba(self, strings: List[str]):\n", " \"\"\"\n", " classify list of strings and return probability.\n", "\n", " Parameters\n", " ----------\n", " strings : List[str]\n", "\n", " Returns\n", " -------\n", " result: List[dict[str, float]]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.46 s, sys: 403 ms, total: 1.86 s\n", "Wall time: 319 ms\n" ] }, { "data": { "text/plain": [ "[{'not relevant': 0.9896912, 'relevant': 0.010308762},\n", " {'not relevant': 0.007830339, 'relevant': 0.9921697}]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.predict_proba([negative_text, positive_text])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.98 s, sys: 386 ms, total: 3.37 s\n", "Wall time: 583 ms\n" ] }, { "data": { "text/plain": [ "[{'not relevant': 0.9999988, 'relevant': 1.2511766e-06},\n", " {'not relevant': 9.157779e-06, 'relevant': 0.9999908}]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.predict_proba([negative_text, positive_text])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Open relevancy visualization dashboard\n", "\n", "Default when you call `predict_words` it will open a browser with visualization dashboard, you can disable by `visualization=False`.\n", "\n", "```python\n", "def predict_words(\n", " self,\n", " string: str,\n", " method: str = 'last',\n", " bins_size: float = 0.05,\n", " visualization: bool = True,\n", "):\n", " \"\"\"\n", " classify words.\n", "\n", " Parameters\n", " ----------\n", " string : str\n", " method : str, optional (default='last')\n", " Attention layer supported. Allowed values:\n", "\n", " * ``'last'`` - attention from last layer.\n", " * ``'first'`` - attention from first layer.\n", " * ``'mean'`` - average attentions from all layers.\n", " bins_size: float, optional (default=0.05)\n", " default bins size for word distribution histogram.\n", " visualization: bool, optional (default=True)\n", " If True, it will open the visualization dashboard.\n", "\n", " Returns\n", " -------\n", " dictionary: results\n", " \"\"\"\n", "```\n", "\n", "**This method not available for BigBird models**." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "quantized_model.predict_words(negative_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorize\n", "\n", "Let say you want to visualize sentence / word level in lower dimension, you can use `model.vectorize`,\n", "\n", "```python\n", "def vectorize(self, strings: List[str], method: str = 'first'):\n", " \"\"\"\n", " vectorize list of strings.\n", "\n", " Parameters\n", " ----------\n", " strings: List[str]\n", " method : str, optional (default='first')\n", " Vectorization layer supported. Allowed values:\n", "\n", " * ``'last'`` - vector from last sequence.\n", " * ``'first'`` - vector from first sequence.\n", " * ``'mean'`` - average vectors from all sequences.\n", " * ``'word'`` - average vectors based on tokens.\n", "\n", " Returns\n", " -------\n", " result: np.array\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sentence level" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "texts = [negative_text, positive_text]\n", "r = model.vectorize(texts, method = 'first')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2, 2)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.manifold import TSNE\n", "import matplotlib.pyplot as plt\n", "\n", "tsne = TSNE().fit_transform(r)\n", "tsne.shape" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize = (7, 7))\n", "plt.scatter(tsne[:, 0], tsne[:, 1])\n", "labels = texts\n", "for label, x, y in zip(\n", " labels, tsne[:, 0], tsne[:, 1]\n", "):\n", " label = (\n", " '%s, %.3f' % (label[0], label[1])\n", " if isinstance(label, list)\n", " else label\n", " )\n", " plt.annotate(\n", " label,\n", " xy = (x, y),\n", " xytext = (0, 0),\n", " textcoords = 'offset points',\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Word level" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "r = quantized_model.vectorize(texts, method = 'word')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "x, y = [], []\n", "for row in r:\n", " x.extend([i[0] for i in row])\n", " y.extend([i[1] for i in row])" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "(211, 2)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tsne = TSNE().fit_transform(y)\n", "tsne.shape" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize = (7, 7))\n", "plt.scatter(tsne[:, 0], tsne[:, 1])\n", "labels = x\n", "for label, x, y in zip(\n", " labels, tsne[:, 0], tsne[:, 1]\n", "):\n", " label = (\n", " '%s, %.3f' % (label[0], label[1])\n", " if isinstance(label, list)\n", " else label\n", " )\n", " plt.annotate(\n", " label,\n", " xy = (x, y),\n", " xytext = (0, 0),\n", " textcoords = 'offset points',\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pretty good, the model able to know cluster bottom left as positive relevancy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stacking models\n", "\n", "More information, you can read at [https://malaya.readthedocs.io/en/latest/Stack.html](https://malaya.readthedocs.io/en/latest/Stack.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:loading sentence piece model\n" ] } ], "source": [ "albert = malaya.relevancy.transformer(model = 'albert')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'not relevant': 3.1056952e-06, 'relevant': 0.9999934},\n", " {'not relevant': 0.99982065, 'relevant': 3.868528e-05}]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.stack.predict_stack([albert, model], [positive_text, negative_text])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }