{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Subjectivity Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/subjectivity](https://github.com/huseinzol05/Malaya/tree/master/example/subjectivity).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module trained on both standard and local (included social media) language structures, so it is save to use for both.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:numexpr.utils:NumExpr defaulting to 8 threads.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.2 s, sys: 752 ms, total: 5.95 s\n", "Wall time: 5.58 s\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### labels supported\n", "\n", "Default labels for subjectivity module." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['negative', 'positive']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.subjectivity.label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explanation\n", "\n", "Positive subjectivity: based on or influenced by personal feelings, tastes, or opinions. Can be a positive or negative sentiment.\n", "\n", "Negative subjectivity: based on a report or a fact. Can be a positive or negative sentiment." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "negative_text = 'Kerajaan negeri Kelantan mempersoalkan motif kenyataan Menteri Kewangan Lim Guan Eng yang hanya menyebut Kelantan penerima terbesar bantuan kewangan dari Kerajaan Persekutuan. Sedangkan menurut Timbalan Menteri Besarnya, Datuk Mohd Amar Nik Abdullah, negeri lain yang lebih maju dari Kelantan turut mendapat pembiayaan dan pinjaman.'\n", "positive_text = 'kerajaan sebenarnya sangat bencikan rakyatnya, minyak naik dan segalanya'\n", "\n", "string1 = 'Sis, students from overseas were brought back because they are not in their countries which is if something happens to them, its not the other countries’ responsibility. Student dalam malaysia ni dah dlm tggjawab kerajaan. Mana part yg tak faham?'\n", "string2 = 'Harap kerajaan tak bukak serentak. Slowly release week by week. Focus on economy related industries dulu'\n", "string3 = 'Idk if aku salah baca ke apa. Bayaran rm350 utk golongan umur 21 ke bawah shj ? Anyone? If 21 ke atas ok lah. If umur 21 ke bawah? Are you serious? Siapa yg lebih byk komitmen? Aku hrp aku salah baca. Aku tk jumpa artikel tu'\n", "string4 = 'Jabatan Penjara Malaysia diperuntukkan RM20 juta laksana program pembangunan Insan kepada banduan. Majikan yang menggaji bekas banduan, bekas penagih dadah diberi potongan cukai tambahan sehingga 2025.'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load multinomial model\n", "\n", "```python\n", "def multinomial(**kwargs):\n", " \"\"\"\n", " Load multinomial emotion model.\n", "\n", " Returns\n", " -------\n", " result : malaya.model.ml.Bayes class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "model = malaya.subjectivity.multinomial()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict batch of strings\n", "\n", "```python\n", "def predict(self, strings: List[str], add_neutral: bool = True):\n", " \"\"\"\n", " classify list of strings.\n", "\n", " Parameters\n", " ----------\n", " strings: List[str]\n", " add_neutral: bool, optional (default=True)\n", " if True, it will add neutral probability.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['neutral', 'negative']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict([positive_text,negative_text])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Disable `neutral` probability," ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['positive', 'negative']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict([positive_text,negative_text], add_neutral = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict batch of strings with probability\n", "\n", "```python\n", "def predict_proba(self, strings: List[str], add_neutral: bool = True):\n", " \"\"\"\n", " classify list of strings and return probability.\n", "\n", " Parameters\n", " ----------\n", " strings: List[str]\n", " add_neutral: bool, optional (default=True)\n", " if True, it will add neutral probability.\n", "\n", " Returns\n", " -------\n", " result: List[dict[str, float]]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'negative': 0.420659316666446, 'positive': 0.5793406833335559},\n", " {'negative': 0.7906212884104161, 'positive': 0.2093787115895868}]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba([positive_text,negative_text], add_neutral = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available Transformer models" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.subjectivity:trained on 80% dataset, tested on another 20% test set, dataset at https://github.com/huseinzol05/Malay-Dataset/tree/master/corpus/subjectivity\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)macro precisionmacro recallmacro f1-score
bert425.6111.000.920040.917480.91663
tiny-bert57.415.400.910230.902280.90301
albert48.612.800.905440.902990.90300
tiny-albert22.45.980.894570.894690.89461
xlnet446.6118.000.919160.917530.91761
alxlnet46.813.300.908620.908350.90817
fastformer458.0116.000.807850.819730.80758
tiny-fastformer77.319.700.871470.871470.87105
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) macro precision \\\n", "bert 425.6 111.00 0.92004 \n", "tiny-bert 57.4 15.40 0.91023 \n", "albert 48.6 12.80 0.90544 \n", "tiny-albert 22.4 5.98 0.89457 \n", "xlnet 446.6 118.00 0.91916 \n", "alxlnet 46.8 13.30 0.90862 \n", "fastformer 458.0 116.00 0.80785 \n", "tiny-fastformer 77.3 19.70 0.87147 \n", "\n", " macro recall macro f1-score \n", "bert 0.91748 0.91663 \n", "tiny-bert 0.90228 0.90301 \n", "albert 0.90299 0.90300 \n", "tiny-albert 0.89469 0.89461 \n", "xlnet 0.91753 0.91761 \n", "alxlnet 0.90835 0.90817 \n", "fastformer 0.81973 0.80758 \n", "tiny-fastformer 0.87147 0.87105 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.subjectivity.available_transformer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Transformer model\n", "\n", "All model interface will follow sklearn interface started v3.4,\n", "\n", "```python\n", "def transformer(model: str = 'bert', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load Transformer subjectivity model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='bert')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'bert'`` - Google BERT BASE parameters.\n", " * ``'tiny-bert'`` - Google BERT TINY parameters.\n", " * ``'albert'`` - Google ALBERT BASE parameters.\n", " * ``'tiny-albert'`` - Google ALBERT TINY parameters.\n", " * ``'xlnet'`` - Google XLNET BASE parameters.\n", " * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.\n", " * ``'fastformer'`` - FastFormer BASE parameters.\n", " * ``'tiny-fastformer'`` - FastFormer TINY parameters.\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result: model\n", " List of model classes:\n", "\n", " * if `bert` in model, will return `malaya.model.bert.BinaryBERT`.\n", " * if `xlnet` in model, will return `malaya.model.xlnet.BinaryXLNET`.\n", " * if `fastformer` in model, will return `malaya.model.fastformer.BinaryFastFormer`.\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "101%|██████████| 49.0/48.6 [00:35<00:00, 1.39MB/s]\n", "184%|██████████| 1.00/0.54 [00:02<-1:59:59, 2.83s/MB]\n", "135%|██████████| 1.00/0.74 [00:03<00:00, 3.73s/MB]\n" ] } ], "source": [ "model = malaya.subjectivity.transformer(model = 'albert')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:loading sentence piece model\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:loading sentence piece model\n" ] } ], "source": [ "quantized_model = malaya.subjectivity.transformer(model = 'albert', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict batch of strings\n", "\n", "```python\n", "def predict(self, strings: List[str], add_neutral: bool = True):\n", " \"\"\"\n", " classify list of strings.\n", "\n", " Parameters\n", " ----------\n", " strings: List[str]\n", " add_neutral: bool, optional (default=True)\n", " if True, it will add neutral probability.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['negative', 'negative']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict([negative_text, positive_text])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['negative', 'negative']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_model.predict([negative_text, positive_text])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict batch of strings with probability\n", "\n", "```python\n", "def predict_proba(self, strings: List[str], add_neutral: bool = True):\n", " \"\"\"\n", " classify list of strings and return probability.\n", "\n", " Parameters\n", " ----------\n", " strings: List[str]\n", " add_neutral: bool, optional (default=True)\n", " if True, it will add neutral probability.\n", "\n", " Returns\n", " -------\n", " result: List[dict[str, float]]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'negative': 0.9956738, 'positive': 4.326162e-05, 'neutral': 0.0042829514},\n", " {'negative': 0.9615872, 'positive': 0.00038412912, 'neutral': 0.038028657}]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba([negative_text, positive_text])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'negative': 0.9954784, 'positive': 4.521673e-05, 'neutral': 0.0044763684},\n", " {'negative': 0.9612684, 'positive': 0.00038731584, 'neutral': 0.038344264}]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_model.predict_proba([negative_text, positive_text])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Open subjectivity visualization dashboard\n", "\n", "Default when you call `predict_words` it will open a browser with visualization dashboard, you can disable by `visualization=False`.\n", "\n", "```python\n", "def predict_words(\n", " self,\n", " string: str,\n", " method: str = 'last',\n", " bins_size: float = 0.05,\n", " visualization: bool = True,\n", "):\n", " \"\"\"\n", " classify words.\n", "\n", " Parameters\n", " ----------\n", " string : str\n", " method : str, optional (default='last')\n", " Attention layer supported. Allowed values:\n", "\n", " * ``'last'`` - attention from last layer.\n", " * ``'first'`` - attention from first layer.\n", " * ``'mean'`` - average attentions from all layers.\n", " bins_size: float, optional (default=0.05)\n", " default bins size for word distribution histogram.\n", " visualization: bool, optional (default=True)\n", " If True, it will open the visualization dashboard.\n", "\n", " Returns\n", " -------\n", " dictionary: results\n", " \"\"\"\n", "```\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model.predict_words(negative_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorize\n", "\n", "Let say you want to visualize sentence / word level in lower dimension, you can use `model.vectorize`,\n", "\n", "```python\n", "def vectorize(self, strings: List[str], method: str = 'first'):\n", " \"\"\"\n", " vectorize list of strings.\n", "\n", " Parameters\n", " ----------\n", " strings: List[str]\n", " method : str, optional (default='first')\n", " Vectorization layer supported. Allowed values:\n", "\n", " * ``'last'`` - vector from last sequence.\n", " * ``'first'`` - vector from first sequence.\n", " * ``'mean'`` - average vectors from all sequences.\n", " * ``'word'`` - average vectors based on tokens.\n", "\n", " Returns\n", " -------\n", " result: np.array\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sentence level" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "texts = [negative_text, positive_text, string1, string2]\n", "r = quantized_model.vectorize(texts, method = 'first')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4, 2)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.manifold import TSNE\n", "import matplotlib.pyplot as plt\n", "\n", "tsne = TSNE().fit_transform(r)\n", "tsne.shape" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize = (7, 7))\n", "plt.scatter(tsne[:, 0], tsne[:, 1])\n", "labels = texts\n", "for label, x, y in zip(\n", " labels, tsne[:, 0], tsne[:, 1]\n", "):\n", " label = (\n", " '%s, %.3f' % (label[0], label[1])\n", " if isinstance(label, list)\n", " else label\n", " )\n", " plt.annotate(\n", " label,\n", " xy = (x, y),\n", " xytext = (0, 0),\n", " textcoords = 'offset points',\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Word level" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "r = quantized_model.vectorize(texts, method = 'word')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "x, y = [], []\n", "for row in r:\n", " x.extend([i[0] for i in row])\n", " y.extend([i[1] for i in row])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(109, 2)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tsne = TSNE().fit_transform(y)\n", "tsne.shape" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize = (7, 7))\n", "plt.scatter(tsne[:, 0], tsne[:, 1])\n", "labels = x\n", "for label, x, y in zip(\n", " labels, tsne[:, 0], tsne[:, 1]\n", "):\n", " label = (\n", " '%s, %.3f' % (label[0], label[1])\n", " if isinstance(label, list)\n", " else label\n", " )\n", " plt.annotate(\n", " label,\n", " xy = (x, y),\n", " xytext = (0, 0),\n", " textcoords = 'offset points',\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pretty good, the model able to know cluster top side as positive subjectivity, bottom side as negative subjectivity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stacking models\n", "\n", "More information, you can read at [https://malaya.readthedocs.io/en/latest/Stack.html](https://malaya.readthedocs.io/en/latest/Stack.html)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "multinomial = malaya.subjectivity.multinomial()\n", "alxlnet = malaya.subjectivity.transformer(model = 'alxlnet')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'negative': 0.19735892950073536,\n", " 'positive': 0.003119166818228667,\n", " 'neutral': 0.1160071232668102}]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.stack.predict_stack([multinomial, model, alxlnet], [positive_text])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'negative': 0.7424157666636825, 'positive': 0.04498033797670938}]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.stack.predict_stack([multinomial, model, alxlnet], [positive_text], add_neutral = False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }