{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/zeroshot-classification](https://github.com/huseinzol05/Malaya/tree/master/example/zeroshot-classification).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module trained on both standard and local (included social media) language structures, so it is save to use for both.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''\n", "os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", " warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n", "CPU times: user 3.03 s, sys: 2.5 s, total: 5.53 s\n", "Wall time: 2.74 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### what is zero-shot classification\n", "\n", "Commonly we supervised a machine learning on specific labels, negative / positive for sentiment, anger / happy / sadness for emotion and etc. The model cannot give an output if we want to know how much percentage of 'jealous' in emotion analysis model because supported labels are only {anger, happy, sadness}. Imagine, for example, trying to identify a text without ever having seen one 'jealous' label before, impossible. **So, zero-shot trying to solve this problem.**\n", "\n", "zero-shot learning refers to the process by which a machine learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification.\n", "\n", "[Yin et al. (2019)](https://arxiv.org/abs/1909.00161) stated in his paper, any pretrained language model finetuned on text similarity actually can acted as an out-of-the-box zero-shot text classifier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, we are going to use transformer models from `malaya.similarity.semantic.huggingface` with a little tweaks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace models" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mesolitica/finetune-mnli-t5-super-tiny-standard-bahasa-cased': {'Size (MB)': 50.7,\n", " 'macro precision': 0.74562,\n", " 'macro recall': 0.74574,\n", " 'macro f1-score': 0.74501},\n", " 'mesolitica/finetune-mnli-t5-tiny-standard-bahasa-cased': {'Size (MB)': 139,\n", " 'macro precision': 0.76584,\n", " 'macro recall': 0.76565,\n", " 'macro f1-score': 0.76542},\n", " 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased': {'Size (MB)': 242,\n", " 'macro precision': 0.78067,\n", " 'macro recall': 0.78063,\n", " 'macro f1-score': 0.7801},\n", " 'mesolitica/finetune-mnli-t5-base-standard-bahasa-cased': {'Size (MB)': 892,\n", " 'macro precision': 0.78903,\n", " 'macro recall': 0.79064,\n", " 'macro f1-score': 0.78918}}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.zero_shot.classification.available_huggingface" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load HuggingFace model\n", "\n", "```python\n", "def huggingface(\n", " model: str = 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased',\n", " force_check: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load HuggingFace model to zeroshot text classification.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')\n", " Check available models at `malaya.zero_shot.classification.available_huggingface()`.\n", " force_check: bool, optional (default=True)\n", " Force check model one of malaya model.\n", " Set to False if you have your own huggingface model.\n", "\n", " Returns\n", " -------\n", " result: malaya.torch_model.huggingface.ZeroShotClassification\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n", "You are using the default legacy behaviour of the . If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n", "Some weights of the model checkpoint at mesolitica/finetune-mnli-t5-small-standard-bahasa-cased were not used when initializing T5ForSequenceClassification: ['classification_head.dense.weight', 'classification_head.out_proj.weight', 'classification_head.dense.bias', 'classification_head.out_proj.bias']\n", "- This IS expected if you are initializing T5ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing T5ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n", "Some weights of T5ForSequenceClassification were not initialized from the model checkpoint at mesolitica/finetune-mnli-t5-small-standard-bahasa-cased and are newly initialized: ['classifier.weight', 'classifier.bias']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] } ], "source": [ "model = malaya.zero_shot.classification.huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### predict batch\n", "\n", "```python\n", "def predict_proba(\n", " self,\n", " strings: List[str],\n", " labels: List[str],\n", " prefix: str = 'ayat ini berkaitan tentang',\n", " multilabel: bool = True,\n", "):\n", " \"\"\"\n", " classify list of strings and return probability.\n", "\n", " Parameters\n", " ----------\n", " strings: List[str]\n", " labels: List[str]\n", " prefix: str, optional (default='ayat ini berkaitan tentang')\n", " prefix of labels to zero shot. Playing around with prefix can get better results.\n", " multilabel: bool, optional (default=True)\n", " probability of labels can be more than 1.0\n", "```\n", "\n", "Because it is a zero-shot, we need to give labels for the model." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# copy from twitter\n", "\n", "string = 'gov macam bengong, kami nk pilihan raya, gov backdoor, sakai'" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n" ] }, { "data": { "text/plain": [ "[{'najib razak': 0.47769544,\n", " 'mahathir': 0.49602416,\n", " 'kerajaan': 0.49770266,\n", " 'PRU': 0.5020965,\n", " 'anarki': 0.47935393}]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba([string], labels = ['najib razak', 'mahathir', 'kerajaan', 'PRU', 'anarki'])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "string = 'tolong order foodpanda jab, lapar'" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'makan': 0.49923217,\n", " 'makanan': 0.50025105,\n", " 'novel': 0.50996864,\n", " 'buku': 0.5179709,\n", " 'kerajaan': 0.52829444,\n", " 'food delivery': 0.5014325}]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the model understood `order foodpanda` got close relationship with `makan`, `makanan` and `food delivery`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[{'makan': 0.4649984,\n", " 'makanan': 0.4640362,\n", " 'novel': 0.513372,\n", " 'buku': 0.50357056,\n", " 'kerajaan': 0.52359533,\n", " 'food delivery': 0.49170837,\n", " 'kerajaan jahat': 0.51742524,\n", " 'kerajaan prihatin': 0.5301894,\n", " 'bantuan rakyat': 0.5329738}]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',\n", " 'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### able to infer for mixed MS and EN" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "string = 'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:'" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'makan': 0.446774,\n", " 'makanan': 0.44988176,\n", " 'novel': 0.46197286,\n", " 'buku': 0.46333978,\n", " 'kerajaan': 0.47140214,\n", " 'food delivery': 0.4580575,\n", " 'kerajaan jahat': 0.45878497,\n", " 'kerajaan prihatin': 0.46423724,\n", " 'bantuan rakyat': 0.44994086,\n", " 'biskut': 0.4561887,\n", " 'very helpful': 0.4278154,\n", " 'sharing experiences': 0.458337,\n", " 'sharing session': 0.4593555}]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',\n", " 'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',\n", " 'biskut', 'very helpful', 'sharing experiences',\n", " 'sharing session'])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'makan': 0.44845203,\n", " 'makanan': 0.45382816,\n", " 'novel': 0.46462435,\n", " 'buku': 0.4645531,\n", " 'kerajaan': 0.47166649,\n", " 'food delivery': 0.460771,\n", " 'kerajaan jahat': 0.46060804,\n", " 'kerajaan prihatin': 0.46640876,\n", " 'bantuan rakyat': 0.4555358,\n", " 'biskut': 0.45959476,\n", " 'very helpful': 0.4333261,\n", " 'sharing experiences': 0.4627727,\n", " 'sharing session': 0.4624747}]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',\n", " 'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',\n", " 'biskut', 'very helpful', 'sharing experiences',\n", " 'sharing session'],\n", " prefix = 'teks ini berkaitan tentang')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiclasses but not multilabel\n", "\n", "Sum of probability equal to 1.0, so to do that, set `multilabel=False`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'makan': 0.07241088,\n", " 'makanan': 0.07458564,\n", " 'novel': 0.07798509,\n", " 'buku': 0.07774425,\n", " 'kerajaan': 0.07856386,\n", " 'food delivery': 0.07746716,\n", " 'kerajaan jahat': 0.076677,\n", " 'kerajaan prihatin': 0.07793205,\n", " 'bantuan rakyat': 0.08176315,\n", " 'biskut': 0.07559921,\n", " 'very helpful': 0.07805643,\n", " 'sharing experiences': 0.07687527,\n", " 'sharing session': 0.07434013}]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'\n", "\n", "model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',\n", " 'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',\n", " 'biskut', 'very helpful', 'sharing experiences',\n", " 'sharing session'], multilabel = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stacking models\n", "\n", "More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html\n", "\n", "If you want to stack zero-shot classification models, you need to pass labels using keyword parameter,\n", "\n", "```python\n", "malaya.stack.predict_stack([model1, model2], List[str], labels = List[str])\n", "```\n", "\n", "We will passed `labels` as `**kwargs`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'makan': 0.46499833,\n", " 'makanan': 0.4640362,\n", " 'novel': 0.513372,\n", " 'buku': 0.50357056,\n", " 'kerajaan': 0.52359533,\n", " 'food delivery': 0.49170837,\n", " 'kerajaan jahat': 0.51742524,\n", " 'kerajaan prihatin': 0.53018934,\n", " 'bantuan rakyat': 0.53297377,\n", " 'comel': 0.49191207,\n", " 'kerajaan syg sgt kepada rakyat': 0.5212374}]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'\n", "labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery', \n", " 'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat', 'comel', 'kerajaan syg sgt kepada rakyat']\n", "malaya.stack.predict_stack([model, model, model], [string], \n", " labels = labels)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }