{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Semantic Similarity HuggingFace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/semantic-similarity-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/semantic-similarity-huggingface).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module trained on both standard and local (included social media) language structures, so it is save to use for both.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''\n", "os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.11 s, sys: 3.56 s, total: 6.67 s\n", "Wall time: 2.22 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3372\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3890\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'\n", "string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'\n", "string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'\n", "string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "news1 = 'Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19'\n", "tweet1 = 'DrM sembang pilihan raya tak boleh buat sebab COVID 19'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace models" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.similarity.semantic:tested on matched dev set translated MNLI, https://huggingface.co/datasets/mesolitica/translated-MNLI\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)macro precisionmacro recallmacro f1-score
mesolitica/finetune-mnli-t5-super-tiny-standard-bahasa-cased50.70.887560.8870.88727
mesolitica/finetune-mnli-t5-tiny-standard-bahasa-cased139.00.887560.8870.88727
mesolitica/finetune-mnli-t5-small-standard-bahasa-cased242.00.887560.8870.88727
mesolitica/finetune-mnli-t5-base-standard-bahasa-cased892.00.887560.8870.88727
\n", "
" ], "text/plain": [ " Size (MB) \\\n", "mesolitica/finetune-mnli-t5-super-tiny-standard... 50.7 \n", "mesolitica/finetune-mnli-t5-tiny-standard-bahas... 139.0 \n", "mesolitica/finetune-mnli-t5-small-standard-baha... 242.0 \n", "mesolitica/finetune-mnli-t5-base-standard-bahas... 892.0 \n", "\n", " macro precision \\\n", "mesolitica/finetune-mnli-t5-super-tiny-standard... 0.88756 \n", "mesolitica/finetune-mnli-t5-tiny-standard-bahas... 0.88756 \n", "mesolitica/finetune-mnli-t5-small-standard-baha... 0.88756 \n", "mesolitica/finetune-mnli-t5-base-standard-bahas... 0.88756 \n", "\n", " macro recall \\\n", "mesolitica/finetune-mnli-t5-super-tiny-standard... 0.887 \n", "mesolitica/finetune-mnli-t5-tiny-standard-bahas... 0.887 \n", "mesolitica/finetune-mnli-t5-small-standard-baha... 0.887 \n", "mesolitica/finetune-mnli-t5-base-standard-bahas... 0.887 \n", "\n", " macro f1-score \n", "mesolitica/finetune-mnli-t5-super-tiny-standard... 0.88727 \n", "mesolitica/finetune-mnli-t5-tiny-standard-bahas... 0.88727 \n", "mesolitica/finetune-mnli-t5-small-standard-baha... 0.88727 \n", "mesolitica/finetune-mnli-t5-base-standard-bahas... 0.88727 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.similarity.semantic.available_huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load HuggingFace model\n", "\n", "```python\n", "def huggingface(model: str = 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased', **kwargs):\n", " \"\"\"\n", " Load HuggingFace model to calculate semantic similarity between 2 sentences.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')\n", " Check available models at `malaya.similarity.semantic.available_huggingface()`.\n", "\n", " Returns\n", " -------\n", " result: malaya.torch_model.huggingface.Similarity\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "model = malaya.similarity.semantic.huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### predict batch of strings with probability\n", "\n", "```python\n", "def predict_proba(self, strings_left: List[str], strings_right: List[str]):\n", " \"\"\"\n", " calculate similarity for two different batch of texts.\n", "\n", " Parameters\n", " ----------\n", " strings_left : List[str]\n", " strings_right : List[str]\n", "\n", " Returns\n", " -------\n", " list: List[float]\n", " \"\"\"\n", "```\n", "\n", "you need to give list of left strings, and list of right strings.\n", "\n", "first left string will compare will first right string and so on.\n", "\n", "similarity model only supported `predict_proba`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.9973929 , 0.00111997, 0.5448353 , 0.0183536 ], dtype=float32)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba([string1, string2, news1, news1], [string3, string4, tweet1, string1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### able to infer for mixed MS and EN" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "en = 'Youth on hunger strike urge the government to be concerned about the climate issue'\n", "en2 = 'the end of the wrld, global warming!'" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.01690625, 0.9966125 ], dtype=float32)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba([string1, string1], [en2, en])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }