{ "cells": [ { "cell_type": "markdown", "id": "initial-avatar", "metadata": {}, "source": [ "# Doc2Vec" ] }, { "cell_type": "markdown", "id": "current-accident", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/similarity-doc2vec](https://github.com/huseinzol05/Malaya/tree/master/example/similarity-doc2vec).\n", " \n", "
" ] }, { "cell_type": "markdown", "id": "significant-sarah", "metadata": {}, "source": [ "
\n", "\n", "This module trained on both standard and local (included social media) language structures, so it is save to use for both.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "id": "inside-queensland", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", " warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n", "CPU times: user 2.74 s, sys: 3.75 s, total: 6.49 s\n", "Wall time: 1.97 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "code", "execution_count": 3, "id": "polished-broadcast", "metadata": {}, "outputs": [], "source": [ "string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'\n", "string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'\n", "string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'\n", "string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'" ] }, { "cell_type": "code", "execution_count": 4, "id": "increasing-picnic", "metadata": {}, "outputs": [], "source": [ "news1 = 'Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19'\n", "tweet1 = 'DrM sembang pilihan raya tak boleh buat sebab COVID 19'" ] }, { "cell_type": "markdown", "id": "collect-housing", "metadata": {}, "source": [ "### Vectorizer Model\n", "\n", "We can use any Vectorizer models provided by Malaya to use encoder similarity interface, example, BERT, XLNET. Again, these encoder models not trained to do similarity classification, it just encode the strings into vector representation.\n", "\n", "```python\n", "def vectorizer(v):\n", " \"\"\"\n", " Doc2vec interface for text similarity using Encoder model.\n", "\n", " Parameters\n", " ----------\n", " v: object\n", " encoder interface object, BERT, XLNET.\n", " should have `vectorize` method.\n", "\n", " Returns\n", " -------\n", " result: malaya.similarity.doc2vec.VectorizerSimilarity\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n" ] } ], "source": [ "model = malaya.transformer.huggingface()" ] }, { "cell_type": "code", "execution_count": 7, "id": "smaller-kruger", "metadata": {}, "outputs": [], "source": [ "doc2vec_vectorizer = malaya.similarity.doc2vec.vectorizer(model)" ] }, { "cell_type": "markdown", "id": "marked-setting", "metadata": {}, "source": [ "#### predict for 2 strings with probability\n", "\n", "```python\n", "def predict_proba(\n", " self,\n", " left_strings: List[str],\n", " right_strings: List[str],\n", " similarity: str = 'cosine',\n", "):\n", " \"\"\"\n", " calculate similarity for two different batch of texts.\n", "\n", " Parameters\n", " ----------\n", " left_strings : list of str\n", " right_strings : list of str\n", " similarity : str, optional (default='mean')\n", " similarity supported. Allowed values:\n", "\n", " * ``'cosine'`` - cosine similarity.\n", " * ``'euclidean'`` - euclidean similarity.\n", " * ``'manhattan'`` - manhattan similarity.\n", "\n", " Returns\n", " -------\n", " result: List[float]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "id": "tight-colors", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You're using a ElectraTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 314 ms, sys: 0 ns, total: 314 ms\n", "Wall time: 111 ms\n" ] }, { "data": { "text/plain": [ "array([0.93616056], dtype=float32)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "doc2vec_vectorizer.predict_proba([string1], [string2])" ] }, { "cell_type": "code", "execution_count": 9, "id": "upset-resource", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 317 ms, sys: 0 ns, total: 317 ms\n", "Wall time: 32.9 ms\n" ] }, { "data": { "text/plain": [ "array([0.97735137, 0.96125495], dtype=float32)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "doc2vec_vectorizer.predict_proba([string1, string2], [string3, string4])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }