{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Stacking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/stacking](https://github.com/huseinzol05/Malaya/tree/master/example/stacking).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why Stacking?\n", "\n", "Sometime a single model is not good enough. So, you need to use multiple models to get a better result! It called stacking." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.91 s, sys: 2.59 s, total: 5.49 s\n", "Wall time: 2.45 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mesolitica/sentiment-analysis-nanot5-tiny-malaysian-cased': {'Size (MB)': 93,\n", " 'macro precision': 0.67768,\n", " 'macro recall': 0.68266,\n", " 'macro f1-score': 0.67997},\n", " 'mesolitica/sentiment-analysis-nanot5-small-malaysian-cased': {'Size (MB)': 167,\n", " 'macro precision': 0.67602,\n", " 'macro recall': 0.6712,\n", " 'macro f1-score': 0.67339}}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.sentiment.available_huggingface" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "tiny = malaya.sentiment.huggingface(model = 'mesolitica/sentiment-analysis-nanot5-tiny-malaysian-cased')\n", "small = malaya.sentiment.huggingface(model = 'mesolitica/sentiment-analysis-nanot5-small-malaysian-cased')\n", "multinomial = malaya.sentiment.multinomial()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stack multiple sentiment models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`malaya.stack.predict_stack` provide an easy stacking solution for Malaya models. Well, not just for sentiment models, any classification models can use `malaya.stack.predict_stack`.\n", "\n", "```python\n", "def predict_stack(\n", " models, strings: List[str], aggregate: Callable = gmean, **kwargs\n", "):\n", " \"\"\"\n", " Stacking for predictive models.\n", "\n", " Parameters\n", " ----------\n", " models: List[Callable]\n", " list of models.\n", " strings: List[str]\n", " aggregate : Callable, optional (default=scipy.stats.mstats.gmean)\n", " Aggregate function.\n", "\n", " Returns\n", " -------\n", " result: dict\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/model/stem.py:28: FutureWarning: Possible nested set at position 3\n", " or re.findall(_expressions['ic'], word.lower())\n" ] }, { "data": { "text/plain": [ "[{'negative': 0.6198190231869043,\n", " 'neutral': 0.20203009823814663,\n", " 'positive': 0.0405395204541597}]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.stack.predict_stack([tiny, small, multinomial],\n", " ['harga minyak tak menentu'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stack tagging models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For tagging models, we use majority voting stacking. So you need to need have more than 2 models to make it perfect, or else, it will pick randomly from 2 models. `malaya.stack.voting_stack` provides easy interface for this kind of stacking. **But only can use for Entites, POS and Dependency Parsing recognition.**\n", "\n", "```python\n", "def voting_stack(models, text):\n", " \"\"\"\n", " Stacking for POS and Entities Recognition models.\n", "\n", " Parameters\n", " ----------\n", " models: list\n", " list of models\n", " text: str\n", " string to predict\n", "\n", " Returns\n", " -------\n", " result: list\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n", "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n", "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.\n", "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.\n" ] }, { "data": { "text/plain": [ "[('KUALA', 'PROPN'),\n", " ('LUMPUR:', 'PROPN'),\n", " ('Sempena', 'PROPN'),\n", " ('sambutan', 'NOUN'),\n", " ('Aidilfitri', 'PROPN'),\n", " ('minggu', 'NOUN'),\n", " ('depan,', 'ADJ'),\n", " ('Perdana', 'PROPN'),\n", " ('Menteri', 'PROPN'),\n", " ('Tun', 'PROPN'),\n", " ('Dr', 'PROPN'),\n", " ('Mahathir', 'PROPN'),\n", " ('Mohamad', 'PROPN'),\n", " ('dan', 'CCONJ'),\n", " ('Menteri', 'PROPN'),\n", " ('Pengangkutan', 'PROPN'),\n", " ('Anthony', 'PROPN'),\n", " ('Loke', 'PROPN'),\n", " ('Siew', 'PROPN'),\n", " ('Fook', 'PROPN'),\n", " ('menitipkan', 'VERB'),\n", " ('pesanan', 'NOUN'),\n", " ('khas', 'ADJ'),\n", " ('kepada', 'ADP'),\n", " ('orang', 'NOUN'),\n", " ('ramai', 'NOUN'),\n", " ('yang', 'PRON'),\n", " ('mahu', 'ADV'),\n", " ('pulang', 'VERB'),\n", " ('ke', 'ADP'),\n", " ('kampung', 'NOUN'),\n", " ('halaman', 'NOUN'),\n", " ('masing-masing.', 'DET'),\n", " ('Dalam', 'ADP'),\n", " ('video', 'NOUN'),\n", " ('pendek', 'ADJ'),\n", " ('terbitan', 'NOUN'),\n", " ('Jabatan', 'PROPN'),\n", " ('Keselamatan', 'PROPN'),\n", " ('Jalan', 'PROPN'),\n", " ('Raya', 'PROPN'),\n", " ('(JKJR)', 'PUNCT'),\n", " ('itu,', 'DET'),\n", " ('Dr', 'PROPN'),\n", " ('Mahathir', 'PROPN'),\n", " ('menasihati', 'VERB'),\n", " ('mereka', 'PRON'),\n", " ('supaya', 'NOUN'),\n", " ('berhenti', 'VERB'),\n", " ('berehat', 'VERB'),\n", " ('dan', 'CCONJ'),\n", " ('tidur', 'VERB'),\n", " ('sebentar', 'ADV'),\n", " ('sekiranya', 'ADV'),\n", " ('mengantuk', 'VERB'),\n", " ('ketika', 'SCONJ'),\n", " ('memandu.', 'VERB')]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'\n", "\n", "tiny = malaya.pos.huggingface('mesolitica/pos-t5-tiny-standard-bahasa-cased')\n", "small = malaya.pos.huggingface('mesolitica/pos-t5-small-standard-bahasa-cased')\n", "malaya.stack.voting_stack([tiny, small, small], string)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'\n", "\n", "tiny = malaya.dependency.huggingface(model = 'mesolitica/finetune-dependency-t5-tiny-standard-bahasa-cased')\n", "small = malaya.dependency.huggingface(model = 'mesolitica/finetune-dependency-t5-small-standard-bahasa-cased')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n", "You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n", "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n", "To disable this warning, you can either:\n", "\t- Avoid using `tokenizers` before the fork if possible\n", "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "G\n", "\n", "\n", "\n", "0\n", "0 (None)\n", "\n", "\n", "\n", "21\n", "21 (menitipkan)\n", "\n", "\n", "\n", "0->21\n", "\n", "\n", "root\n", "\n", "\n", "\n", "1\n", "1 (KUALA)\n", "\n", "\n", "\n", "21->1\n", "\n", "\n", "nsubj\n", "\n", "\n", "\n", "22\n", "22 (pesanan)\n", "\n", "\n", "\n", "21->22\n", "\n", "\n", "obj\n", "\n", "\n", "\n", "25\n", "25 (orang)\n", "\n", "\n", "\n", "21->25\n", "\n", "\n", "obl\n", "\n", "\n", "\n", "33\n", "33 (masing-masing.)\n", "\n", "\n", "\n", "21->33\n", "\n", "\n", "punct\n", "\n", "\n", "\n", "46\n", "46 (menasihati)\n", "\n", "\n", "\n", "21->46\n", "\n", "\n", "parataxis\n", "\n", "\n", "\n", "2\n", "2 (LUMPUR:)\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "4\n", "4 (sambutan)\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "compound\n", "\n", "\n", "\n", "3\n", "3 (Sempena)\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "5\n", "5 (Aidilfitri)\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "7\n", "7 (depan,)\n", "\n", "\n", "\n", "4->7\n", "\n", "\n", "punct\n", "\n", "\n", "\n", "8\n", "8 (Perdana)\n", "\n", "\n", "\n", "4->8\n", "\n", "\n", "appos\n", "\n", "\n", "\n", "6\n", "6 (minggu)\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "9\n", "9 (Menteri)\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "15\n", "15 (Menteri)\n", "\n", "\n", "\n", "8->15\n", "\n", "\n", "conj\n", "\n", "\n", "\n", "10\n", "10 (Tun)\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "14\n", "14 (dan)\n", "\n", "\n", "\n", "15->14\n", "\n", "\n", "cc\n", "\n", "\n", "\n", "16\n", "16 (Pengangkutan)\n", "\n", "\n", "\n", "15->16\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "11\n", "11 (Dr)\n", "\n", "\n", "\n", "10->11\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "12\n", "12 (Mahathir)\n", "\n", "\n", "\n", "11->12\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "13\n", "13 (Mohamad)\n", "\n", "\n", "\n", "12->13\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "17\n", "17 (Anthony)\n", "\n", "\n", "\n", "16->17\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "18\n", "18 (Loke)\n", "\n", "\n", "\n", "17->18\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "19\n", "19 (Siew)\n", "\n", "\n", "\n", "18->19\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "20\n", "20 (Fook)\n", "\n", "\n", "\n", "19->20\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "23\n", "23 (khas)\n", "\n", "\n", "\n", "22->23\n", "\n", "\n", "amod\n", "\n", "\n", "\n", "24\n", "24 (kepada)\n", "\n", "\n", "\n", "25->24\n", "\n", "\n", "case\n", "\n", "\n", "\n", "26\n", "26 (ramai)\n", "\n", "\n", "\n", "25->26\n", "\n", "\n", "amod\n", "\n", "\n", "\n", "29\n", "29 (pulang)\n", "\n", "\n", "\n", "25->29\n", "\n", "\n", "acl\n", "\n", "\n", "\n", "35\n", "35 (video)\n", "\n", "\n", "\n", "46->35\n", "\n", "\n", "obl\n", "\n", "\n", "\n", "44\n", "44 (Dr)\n", "\n", "\n", "\n", "46->44\n", "\n", "\n", "nsubj\n", "\n", "\n", "\n", "47\n", "47 (mereka)\n", "\n", "\n", "\n", "46->47\n", "\n", "\n", "obj\n", "\n", "\n", "\n", "49\n", "49 (berhenti)\n", "\n", "\n", "\n", "46->49\n", "\n", "\n", "conj\n", "\n", "\n", "\n", "27\n", "27 (yang)\n", "\n", "\n", "\n", "29->27\n", "\n", "\n", "nsubj\n", "\n", "\n", "\n", "28\n", "28 (mahu)\n", "\n", "\n", "\n", "29->28\n", "\n", "\n", "advmod\n", "\n", "\n", "\n", "31\n", "31 (kampung)\n", "\n", "\n", "\n", "29->31\n", "\n", "\n", "obl\n", "\n", "\n", "\n", "30\n", "30 (ke)\n", "\n", "\n", "\n", "31->30\n", "\n", "\n", "case\n", "\n", "\n", "\n", "32\n", "32 (halaman)\n", "\n", "\n", "\n", "31->32\n", "\n", "\n", "compound\n", "\n", "\n", "\n", "34\n", "34 (Dalam)\n", "\n", "\n", "\n", "35->34\n", "\n", "\n", "case\n", "\n", "\n", "\n", "36\n", "36 (pendek)\n", "\n", "\n", "\n", "35->36\n", "\n", "\n", "amod\n", "\n", "\n", "\n", "37\n", "37 (terbitan)\n", "\n", "\n", "\n", "35->37\n", "\n", "\n", "compound\n", "\n", "\n", "\n", "38\n", "38 (Jabatan)\n", "\n", "\n", "\n", "37->38\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "43\n", "43 (itu,)\n", "\n", "\n", "\n", "37->43\n", "\n", "\n", "det\n", "\n", "\n", "\n", "39\n", "39 (Keselamatan)\n", "\n", "\n", "\n", "38->39\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "40\n", "40 (Jalan)\n", "\n", "\n", "\n", "39->40\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "41\n", "41 (Raya)\n", "\n", "\n", "\n", "40->41\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "42\n", "42 ((JKJR))\n", "\n", "\n", "\n", "41->42\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "45\n", "45 (Mahathir)\n", "\n", "\n", "\n", "44->45\n", "\n", "\n", "flat\n", "\n", "\n", "\n", "48\n", "48 (supaya)\n", "\n", "\n", "\n", "49->48\n", "\n", "\n", "cc\n", "\n", "\n", "\n", "50\n", "50 (berehat)\n", "\n", "\n", "\n", "49->50\n", "\n", "\n", "xcomp\n", "\n", "\n", "\n", "52\n", "52 (tidur)\n", "\n", "\n", "\n", "50->52\n", "\n", "\n", "conj\n", "\n", "\n", "\n", "51\n", "51 (dan)\n", "\n", "\n", "\n", "52->51\n", "\n", "\n", "cc\n", "\n", "\n", "\n", "55\n", "55 (mengantuk)\n", "\n", "\n", "\n", "52->55\n", "\n", "\n", "xcomp\n", "\n", "\n", "\n", "53\n", "53 (sebentar)\n", "\n", "\n", "\n", "55->53\n", "\n", "\n", "case\n", "\n", "\n", "\n", "54\n", "54 (sekiranya)\n", "\n", "\n", "\n", "55->54\n", "\n", "\n", "advmod\n", "\n", "\n", "\n", "57\n", "57 (memandu.)\n", "\n", "\n", "\n", "55->57\n", "\n", "\n", "advcl\n", "\n", "\n", "\n", "56\n", "56 (ketika)\n", "\n", "\n", "\n", "57->56\n", "\n", "\n", "mark\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tagging, indexing = malaya.stack.voting_stack([tiny, small, small], string)\n", "malaya.dependency.dependency_graph(tagging, indexing).to_graphvis()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 2 }