{ "cells": [ { "cell_type": "markdown", "id": "9c3f9467", "metadata": {}, "source": [ "# Embedding" ] }, { "cell_type": "markdown", "id": "35f9065d", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/embedding](https://github.com/huseinzol05/Malaya/tree/master/example/embedding).\n", " \n", "
" ] }, { "cell_type": "markdown", "id": "be978366", "metadata": {}, "source": [ "
\n", "\n", "This module trained on both standard and local (included social media) language structures, so it is save to use for both.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "id": "3bffd739", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''\n", "os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'" ] }, { "cell_type": "code", "execution_count": 2, "id": "d997083d", "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 3, "id": "3fc1e6c6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", " warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n", "CPU times: user 3.27 s, sys: 2.86 s, total: 6.13 s\n", "Wall time: 2.83 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "code", "execution_count": 4, "id": "15ea6b17", "metadata": {}, "outputs": [], "source": [ "string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'\n", "string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'\n", "string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'\n", "string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'" ] }, { "cell_type": "code", "execution_count": 5, "id": "7dc663d6", "metadata": {}, "outputs": [], "source": [ "news1 = 'Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19'\n", "tweet1 = 'DrM sembang pilihan raya tak boleh buat sebab COVID 19'" ] }, { "cell_type": "markdown", "id": "f9d05c4c", "metadata": {}, "source": [ "### List available HuggingFace models" ] }, { "cell_type": "code", "execution_count": 6, "id": "742c6446", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'mesolitica/mistral-embedding-191m-8k-contrastive': {'Size (MB)': 334,\n", " 'embedding size': 768,\n", " 'Suggested length': 8192},\n", " 'mesolitica/mistral-embedding-349m-8k-contrastive': {'Size (MB)': 633,\n", " 'embedding size': 768,\n", " 'Suggested length': 8192},\n", " 'mesolitica/embedding-malaysian-mistral-64M-32k': {'Size (MB)': 96.5,\n", " 'embedding size': 768,\n", " 'Suggested length': 20480}}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.embedding.available_huggingface" ] }, { "cell_type": "code", "execution_count": 7, "id": "50885ed2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Entire Malaysian embedding benchmark at https://huggingface.co/spaces/mesolitica/malaysian-embedding-leaderboard\n", "\n" ] } ], "source": [ "print(malaya.embedding.info)" ] }, { "cell_type": "markdown", "id": "419d30dc", "metadata": {}, "source": [ "### Load HuggingFace model\n", "\n", "```python\n", "def huggingface(\n", " model: str = 'mesolitica/embedding-malaysian-mistral-64M-32k',\n", " force_check: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load HuggingFace model for embedding task.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='mesolitica/embedding-malaysian-mistral-64M-32k')\n", " Check available models at `malaya.embedding.available_huggingface`.\n", " force_check: bool, optional (default=True)\n", " Force check model one of malaya model.\n", " Set to False if you have your own huggingface model.\n", "\n", " Returns\n", " -------\n", " result: malaya.torch_model.huggingface.Embedding\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "id": "5fdd641e", "metadata": {}, "outputs": [], "source": [ "model = malaya.embedding.huggingface()" ] }, { "cell_type": "markdown", "id": "96ab9368", "metadata": {}, "source": [ "#### Encode batch of strings\n", "\n", "```python\n", "def encode(self, strings: List[str]):\n", " \"\"\"\n", " Encode strings into embedding.\n", "\n", " Parameters\n", " ----------\n", " strings: List[str]\n", "\n", " Returns\n", " -------\n", " result: np.array\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "id": "8650cf14", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6, 768)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v = model.encode([string1, string2, string3, string4, tweet1, news1])\n", "v.shape" ] }, { "cell_type": "code", "execution_count": 11, "id": "3fa1f404", "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics.pairwise import cosine_similarity" ] }, { "cell_type": "code", "execution_count": 12, "id": "b6ddcfbd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1.0000004 , 0.570497 , 0.90419084, 0.6907457 , 0.5040159 ,\n", " 0.35596827],\n", " [0.570497 , 0.99999976, 0.52848774, 0.75748587, 0.22503856,\n", " 0.20589375],\n", " [0.90419084, 0.52848774, 1. , 0.69484305, 0.5023028 ,\n", " 0.4378497 ],\n", " [0.6907457 , 0.75748587, 0.69484305, 1. , 0.3340272 ,\n", " 0.28617752],\n", " [0.5040159 , 0.22503856, 0.5023028 , 0.3340272 , 1. ,\n", " 0.6531957 ],\n", " [0.35596827, 0.20589375, 0.4378497 , 0.28617752, 0.6531957 ,\n", " 1. ]], dtype=float32)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cosine_similarity(v)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }