{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extractive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/keyword-extractive](https://github.com/huseinzol05/Malaya/tree/master/example/keyword-extractive).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# !pip3.8 install scikit-learn -U" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", " warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "import malaya" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# https://www.bharian.com.my/berita/nasional/2020/06/698386/isu-bersatu-tun-m-6-yang-lain-saman-muhyiddin\n", "\n", "string = \"\"\"\n", "Dalam saman itu, plaintif memohon perisytiharan, antaranya mereka adalah ahli BERSATU yang sah, masih lagi memegang jawatan dalam parti (bagi pemegang jawatan) dan layak untuk bertanding pada pemilihan parti.\n", "\n", "Mereka memohon perisytiharan bahawa semua surat pemberhentian yang ditandatangani Muhammad Suhaimi bertarikh 28 Mei lalu dan pengesahan melalui mesyuarat Majlis Pimpinan Tertinggi (MPT) parti bertarikh 4 Jun lalu adalah tidak sah dan terbatal.\n", "\n", "Plaintif juga memohon perisytiharan bahawa keahlian Muhyiddin, Hamzah dan Muhammad Suhaimi di dalam BERSATU adalah terlucut, berkuat kuasa pada 28 Februari 2020 dan/atau 29 Februari 2020, menurut Fasal 10.2.3 perlembagaan parti.\n", "\n", "Yang turut dipohon, perisytiharan bahawa Seksyen 18C Akta Pertubuhan 1966 adalah tidak terpakai untuk menghalang pelupusan pertikaian berkenaan oleh mahkamah.\n", "\n", "Perisytiharan lain ialah Fasal 10.2.6 Perlembagaan BERSATU tidak terpakai di atas hal melucutkan/ memberhentikan keahlian semua plaintif.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "# minimum cleaning, just simply to remove newlines.\n", "def cleaning(string):\n", " string = string.replace('\\n', ' ')\n", " string = re.sub('[^A-Za-z\\-() ]+', ' ', string).strip()\n", " string = re.sub(r'[ ]+', ' ', string).strip()\n", " return string\n", "\n", "string = cleaning(string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use RAKE algorithm\n", "\n", "Original implementation from [https://github.com/aneesha/RAKE](https://github.com/aneesha/RAKE). Malaya added attention mechanism into RAKE algorithm.\n", "\n", "```python\n", "def rake(\n", " string: str,\n", " model = None,\n", " vectorizer = None,\n", " top_k: int = 5,\n", " atleast: int = 1,\n", " stopwords = get_stopwords,\n", " **kwargs\n", "):\n", " \"\"\"\n", " Extract keywords using Rake algorithm.\n", "\n", " Parameters\n", " ----------\n", " string: str\n", " model: Object, optional (default=None)\n", " Transformer model or any model has `attention` method.\n", " vectorizer: Object, optional (default=None)\n", " Prefer `sklearn.feature_extraction.text.CountVectorizer` or,\n", " `malaya.text.vectorizer.SkipGramCountVectorizer`.\n", " If None, will generate ngram automatically based on `stopwords`.\n", " top_k: int, optional (default=5)\n", " return top-k results.\n", " ngram: tuple, optional (default=(1,1))\n", " n-grams size.\n", " atleast: int, optional (default=1)\n", " at least count appeared in the string to accept as candidate.\n", " stopwords: List[str], (default=malaya.texts.function.get_stopwords)\n", " A callable that returned a List[str], or a List[str], or a Tuple[str]\n", " For automatic Ngram generator.\n", "\n", " Returns\n", " -------\n", " result: Tuple[float, str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### auto-ngram\n", "\n", "This will auto generated N-size ngram for keyword candidates." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.11666666666666665, 'ditandatangani Muhammad Suhaimi bertarikh Mei'),\n", " (0.08888888888888888, 'mesyuarat Majlis Pimpinan Tertinggi'),\n", " (0.08888888888888888, 'Seksyen C Akta Pertubuhan'),\n", " (0.05138888888888888, 'parti bertarikh Jun'),\n", " (0.04999999999999999, 'keahlian Muhyiddin Hamzah')]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.rake(string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### auto-gram with Attention" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "electra = malaya.transformer.huggingface(model = 'mesolitica/electra-base-generator-bahasa-cased')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You're using a ElectraTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n" ] }, { "data": { "text/plain": [ "[(0.17997009989539167, 'mesyuarat Majlis Pimpinan Tertinggi'),\n", " (0.14834545777331348, 'Seksyen C Akta Pertubuhan'),\n", " (0.12264519202953227, 'ditandatangani Muhammad Suhaimi bertarikh Mei'),\n", " (0.06489439121974774, 'terlucut berkuat kuasa'),\n", " (0.057367315322155055, 'menghalang pelupusan pertikaian')]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.rake(string, model = electra)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### using vectorizer" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from malaya.text.vectorizer import SkipGramCountVectorizer\n", "\n", "stopwords = malaya.text.function.get_stopwords()\n", "vectorizer = SkipGramCountVectorizer(\n", " token_pattern = r'[\\S]+',\n", " ngram_range = (1, 3),\n", " stop_words = stopwords,\n", " lowercase = False,\n", " skip = 2\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.0017052987393271276, 'parti memohon perisytiharan'),\n", " (0.0017036368782590756, 'memohon perisytiharan BERSATU'),\n", " (0.0017012023597074357, 'memohon perisytiharan sah'),\n", " (0.0017012023597074357, 'sah memohon perisytiharan'),\n", " (0.0016992809994779549, 'perisytiharan BERSATU sah')]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.rake(string, vectorizer = vectorizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### fixed-ngram with Attention" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.0033636377889573383, 'Majlis Pimpinan Tertinggi'),\n", " (0.0033245625223539293, 'Majlis Pimpinan (MPT)'),\n", " (0.0032415590393544006, 'mesyuarat Majlis Pimpinan'),\n", " (0.003145062492212815, 'pengesahan Majlis Pimpinan'),\n", " (0.003103919348118483, 'Mei Majlis Pimpinan')]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.rake(string, model = electra, vectorizer = vectorizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Textrank algorithm\n", "\n", "Malaya simply use textrank algorithm.\n", "\n", "```python\n", "def textrank(\n", " string: str,\n", " model = None,\n", " vectorizer = None,\n", " top_k: int = 5,\n", " atleast: int = 1,\n", " stopwords = get_stopwords,\n", " **kwargs\n", "):\n", " \"\"\"\n", " Extract keywords using Textrank algorithm.\n", "\n", " Parameters\n", " ----------\n", " string: str\n", " model: Object, optional (default='None')\n", " model has `fit_transform` or `vectorize` method.\n", " vectorizer: Object, optional (default=None)\n", " Prefer `sklearn.feature_extraction.text.CountVectorizer` or, \n", " `malaya.text.vectorizer.SkipGramCountVectorizer`.\n", " If None, will generate ngram automatically based on `stopwords`.\n", " top_k: int, optional (default=5)\n", " return top-k results.\n", " atleast: int, optional (default=1)\n", " at least count appeared in the string to accept as candidate.\n", " stopwords: List[str], (default=malaya.texts.function.get_stopwords)\n", " A callable that returned a List[str], or a List[str], or a Tuple[str]\n", "\n", " Returns\n", " -------\n", " result: Tuple[float, str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "tfidf = TfidfVectorizer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### auto-ngram with TFIDF\n", "\n", "This will auto generated N-size ngram for keyword candidates." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.00015733542794311347, 'plaintif memohon perisytiharan'),\n", " (0.00012558967338659324, 'Fasal perlembagaan parti'),\n", " (0.00011514136972371928, 'Fasal Perlembagaan BERSATU'),\n", " (0.00011505529351381784, 'parti'),\n", " (0.00010763518993348075, 'memohon perisytiharan')]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.textrank(string, model = tfidf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### auto-ngram with Attention\n", "\n", "This will auto generated N-size ngram for keyword candidates." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mesolitica/roberta-base-bahasa-cased': {'Size (MB)': 443},\n", " 'mesolitica/roberta-tiny-bahasa-cased': {'Size (MB)': 66.1},\n", " 'mesolitica/bert-base-standard-bahasa-cased': {'Size (MB)': 443},\n", " 'mesolitica/bert-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},\n", " 'mesolitica/roberta-base-standard-bahasa-cased': {'Size (MB)': 443},\n", " 'mesolitica/roberta-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},\n", " 'mesolitica/electra-base-generator-bahasa-cased': {'Size (MB)': 140},\n", " 'mesolitica/electra-small-generator-bahasa-cased': {'Size (MB)': 19.3}}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.transformer.available_huggingface" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [], "source": [ "electra = malaya.transformer.huggingface(model = 'mesolitica/electra-small-generator-bahasa-cased')\n", "roberta = malaya.transformer.huggingface(model = 'mesolitica/roberta-tiny-bahasa-cased')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You're using a ElectraTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n" ] }, { "data": { "text/plain": [ "[(6.318265638774778e-05, 'dipohon perisytiharan'),\n", " (6.316746347908478e-05, 'pemegang jawatan'),\n", " (6.316118494856124e-05, 'parti bertarikh Jun'),\n", " (6.316104081180683e-05, 'Februari'),\n", " (6.315819096453449e-05, 'plaintif')]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.textrank(string, model = electra)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n" ] }, { "data": { "text/plain": [ "[(6.592447573341062e-05, 'parti'),\n", " (6.584500366507439e-05, 'keahlian Muhyiddin Hamzah'),\n", " (6.48854995406703e-05, 'dipohon perisytiharan'),\n", " (6.4585235386134e-05, 'surat pemberhentian'),\n", " (6.436106705986197e-05, 'parti bertarikh Jun')]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.textrank(string, model = roberta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### fixed-ngram with Attention" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "stopwords = malaya.text.function.get_stopwords()\n", "vectorizer = SkipGramCountVectorizer(\n", " token_pattern = r'[\\S]+',\n", " ngram_range = (1, 3),\n", " stop_words = stopwords,\n", " lowercase = False,\n", " skip = 2\n", ")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(5.6521309567750785e-09, 'plaintif perisytiharan'),\n", " (5.6520371534856045e-09, 'perisytiharan ahli sah'),\n", " (5.6519577794113005e-09, 'Plaintif perisytiharan keahlian'),\n", " (5.651893084611097e-09, 'Perisytiharan'),\n", " (5.651664613022757e-09, 'plaintif memohon perisytiharan')]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.textrank(string, model = electra, vectorizer = vectorizer)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(5.923583228770409e-09, 'keahlian Muhyiddin Muhammad'),\n", " (5.9165313140370156e-09, 'parti bertarikh'),\n", " (5.913081883473326e-09, 'kuasa Fasal'),\n", " (5.902597482992337e-09, 'C Akta menghalang'),\n", " (5.90093003514351e-09, 'keahlian Muhyiddin')]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.textrank(string, model = roberta, vectorizer = vectorizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Attention mechanism\n", "\n", "Use attention mechanism from transformer model to get important keywords.\n", "\n", "```python\n", "def attention(\n", " string: str,\n", " model,\n", " vectorizer = None,\n", " top_k: int = 5,\n", " atleast: int = 1,\n", " stopwords = get_stopwords,\n", " **kwargs\n", "):\n", " \"\"\"\n", " Extract keywords using Attention mechanism.\n", "\n", " Parameters\n", " ----------\n", " string: str\n", " model: Object\n", " Transformer model or any model has `attention` method.\n", " vectorizer: Object, optional (default=None)\n", " Prefer `sklearn.feature_extraction.text.CountVectorizer` or, \n", " `malaya.text.vectorizer.SkipGramCountVectorizer`.\n", " If None, will generate ngram automatically based on `stopwords`.\n", " top_k: int, optional (default=5)\n", " return top-k results.\n", " atleast: int, optional (default=1)\n", " at least count appeared in the string to accept as candidate.\n", " stopwords: List[str], (default=malaya.texts.function.get_stopwords)\n", " A callable that returned a List[str], or a List[str], or a Tuple[str]\n", "\n", " Returns\n", " -------\n", " result: Tuple[float, str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### auto-ngram\n", "\n", "This will auto generated N-size ngram for keyword candidates." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.7273892781585295, 'menghalang pelupusan pertikaian'),\n", " (0.037768079524707135, 'plaintif memohon perisytiharan'),\n", " (0.03168171774504689, 'dipohon perisytiharan'),\n", " (0.03101700203360666, 'memohon perisytiharan'),\n", " (0.02176717370447907, 'ditandatangani Muhammad Suhaimi bertarikh Mei')]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.attention(string, model = electra)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.07387573112018977, 'plaintif memohon perisytiharan'),\n", " (0.06143066429484969, 'Fasal perlembagaan parti'),\n", " (0.05755474756860026, 'ditandatangani Muhammad Suhaimi bertarikh Mei'),\n", " (0.05666392261960079, 'Fasal Perlembagaan BERSATU'),\n", " (0.05564947309701604, 'memohon perisytiharan')]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.attention(string, model = roberta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### fixed-ngram" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.029306968080037014, 'pertikaian Perisytiharan Fasal'),\n", " (0.029205818109817004, 'pertikaian mahkamah Fasal'),\n", " (0.02919627468587057, 'pertikaian Fasal Perlembagaan'),\n", " (0.02918733579124283, 'pelupusan pertikaian Fasal'),\n", " (0.029172178345266454, 'pertikaian Fasal')]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.attention(string, model = electra, vectorizer = vectorizer)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.003424167402512117, 'parti memohon perisytiharan'),\n", " (0.0032962148932236144, 'memohon perisytiharan BERSATU'),\n", " (0.0031886482418839175, 'plaintif perisytiharan BERSATU'),\n", " (0.003181526276520311, 'BERSATU sah parti'),\n", " (0.0031634494027450396, 'perisytiharan BERSATU sah')]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.attention(string, model = roberta, vectorizer = vectorizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use similarity mechanism\n", "\n", "```python\n", "def similarity(\n", " string: str,\n", " model,\n", " vectorizer = None,\n", " top_k: int = 5,\n", " atleast: int = 1,\n", " stopwords = get_stopwords,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Extract keywords using Sentence embedding VS keyword embedding similarity.\n", "\n", " Parameters\n", " ----------\n", " string: str\n", " model: Object\n", " Transformer model or any model has `vectorize` method.\n", " vectorizer: Object, optional (default=None)\n", " Prefer `sklearn.feature_extraction.text.CountVectorizer` or, \n", " `malaya.text.vectorizer.SkipGramCountVectorizer`.\n", " If None, will generate ngram automatically based on `stopwords`.\n", " top_k: int, optional (default=5)\n", " return top-k results.\n", " atleast: int, optional (default=1)\n", " at least count appeared in the string to accept as candidate.\n", " stopwords: List[str], (default=malaya.texts.function.get_stopwords)\n", " A callable that returned a List[str], or a List[str], or a Tuple[str]\n", "\n", " Returns\n", " -------\n", " result: Tuple[float, str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.8739699, 'plaintif memohon perisytiharan'),\n", " (0.8719046, 'keahlian Muhyiddin Hamzah'),\n", " (0.8637232, 'mahkamah Perisytiharan'),\n", " (0.86043775, 'dipohon perisytiharan'),\n", " (0.8529164, 'memohon perisytiharan')]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.similarity(string, model = roberta)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.9982966, 'keahlian Muhyiddin Hamzah'),\n", " (0.99825895, 'mesyuarat Majlis Pimpinan Tertinggi'),\n", " (0.9981606, 'Fasal perlembagaan parti'),\n", " (0.9981444, 'Fasal Perlembagaan BERSATU'),\n", " (0.9979403, 'ditandatangani Muhammad Suhaimi bertarikh Mei')]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.keyword.extractive.similarity(string, model = electra)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 2 }