{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Encoder" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''\n", "os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.22 s, sys: 3.29 s, total: 6.51 s\n", "Wall time: 2.31 s\n" ] } ], "source": [ "%%time\n", "\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why augmentation\n", "\n", "Let say you have a very limited labelled corpus, and you want to add more, but labelling is very costly.\n", "\n", "So, text augmentation! We provided few augmentation interfaces in Malaya." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "string = 'saya suka makan ayam dan ikan'\n", "text = 'Perdana Menteri berkata, beliau perlu memperoleh maklumat terperinci berhubung isu berkenaan sebelum kerajaan dapat mengambil sebarang tindakan lanjut. Bagaimanapun, beliau yakin masalah itu dapat diselesaikan dan pentadbiran kerajaan boleh berfungsi dengan baik.'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "tokenizer = malaya.preprocessing.Tokenizer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Wordvector\n", "\n", "dictionary of synonym is quite hard to populate, required some domain experts to help us. So we can use wordvector to find nearest words.\n", "\n", "```python\n", "def wordvector(\n", " string: str,\n", " wordvector,\n", " threshold: float = 0.5,\n", " top_n: int = 5,\n", " soft: bool = False,\n", "):\n", " \"\"\"\n", " augmenting a string using wordvector.\n", "\n", " Parameters\n", " ----------\n", " string: str\n", " wordvector: object\n", " wordvector interface object.\n", " threshold: float, optional (default=0.5)\n", " random selection for a word.\n", " soft: bool, optional (default=False)\n", " if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio.\n", " if False, it will throw an exception if a word not in the dictionary.\n", " top_n: int, (default=5)\n", " number of nearest neighbors returned. Length of returned result should as top_n.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Load pretrained wordvector into `malaya.wordvector.WordVector` class will disable eager execution.\n", "2022-12-12 13:17:25.893504: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\n", "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "2022-12-12 13:17:25.897836: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n", "2022-12-12 13:17:25.897855: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31\n", "2022-12-12 13:17:25.897859: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31\n", "2022-12-12 13:17:25.897923: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program\n", "2022-12-12 13:17:25.897944: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.161.3\n" ] } ], "source": [ "vocab_wiki, embedded_wiki = malaya.wordvector.load(model = 'wikipedia')\n", "word_vector_wiki = malaya.wordvector.WordVector(embedded_wiki, vocab_wiki)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-12-12 13:17:25.918738: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 781670400 exceeds 10% of free system memory.\n", "2022-12-12 13:17:26.159773: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 781670400 exceeds 10% of free system memory.\n", "2022-12-12 13:17:26.165027: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 781670400 exceeds 10% of free system memory.\n", "2022-12-12 13:17:26.261865: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 781670400 exceeds 10% of free system memory.\n" ] }, { "data": { "text/plain": [ "['saya suka makan ayam dan ikan',\n", " 'kamu suka minum ayam dan ayam',\n", " 'anda suka tidur ayam dan ular',\n", " 'kami suka mandi ayam dan keju',\n", " 'aku suka berehat ayam dan lembu']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.augmentation.encoder.wordvector(\n", " ' '.join(tokenizer.tokenize(string)), word_vector_wiki, soft = True\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-12-12 13:17:26.395871: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 781670400 exceeds 10% of free system memory.\n" ] }, { "data": { "text/plain": [ "['Perdana Menteri berkata , beliau perlu memperoleh maklumat terperinci berhubung isu berkenaan sebelum kerajaan dapat mengambil sebarang tindakan lanjut . Bagaimanapun , beliau yakin masalah itu dapat diselesaikan dan pentadbiran kerajaan boleh berfungsi dengan baik .',\n", " 'Perdana Menteri berkata , dia perlu mendapatkan data terperinci berhubung isu berkaitan sebelum kerajaan dapat mendapat segala tindakan terperinci . Bagaimanapun , dia yakin gangguan tersebut boleh diselesaikan dan pentadbiran kerajaan dapat berfungsi dengan sempurna .',\n", " 'Perdana Menteri berkata , baginda perlu memperolehi bacaan terperinci berhubung isu tertentu sebelum kerajaan dapat menghabiskan sesuatu tindakan lanjutan . Bagaimanapun , baginda yakin kelemahan ini harus diselesaikan dan pentadbiran kerajaan harus berfungsi dengan kuat .',\n", " 'Perdana Menteri berkata , mereka perlu meraih penjelasan terperinci berhubung isu tersebut sebelum kerajaan dapat mengubah suatu tindakan ringkas . Bagaimanapun , mereka yakin gejala itulah perlu diselesaikan dan pentadbiran kerajaan perlu berfungsi dengan hebat .',\n", " 'Perdana Menteri berkata , saya perlu menerima informasi terperinci berhubung isu berlainan sebelum kerajaan dapat memakan pelbagai tindakan positif . Bagaimanapun , saya yakin risiko inilah mampu diselesaikan dan pentadbiran kerajaan akan berfungsi dengan kukuh .']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.augmentation.encoder.wordvector(\n", " ' '.join(tokenizer.tokenize(text)), word_vector_wiki, soft = True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Transformer\n", "\n", "Problem with wordvector, it just replaced a word for near synonym without understood the whole sentence context, so, Transformer comes to the rescue!\n", "\n", "```python\n", "def transformer(\n", " string: str,\n", " model,\n", " threshold: float = 0.5,\n", " top_p: float = 0.9,\n", " top_k: int = 100,\n", " temperature: float = 1.0,\n", " top_n: int = 5,\n", "):\n", "\n", " \"\"\"\n", " augmenting a string using transformer + nucleus sampling / top-k sampling.\n", "\n", " Parameters\n", " ----------\n", " string: str\n", " model: object\n", " transformer interface object. Right now only supported BERT, ALBERT and ELECTRA.\n", " threshold: float, optional (default=0.5)\n", " random selection for a word.\n", " top_p: float, optional (default=0.8)\n", " cumulative sum of probabilities to sample a word. \n", " If top_n bigger than 0, the model will use nucleus sampling, else top-k sampling.\n", " top_k: int, optional (default=100)\n", " k for top-k sampling.\n", " temperature: float, optional (default=0.8)\n", " logits * temperature.\n", " top_n: int, (default=5)\n", " number of nearest neighbors returned. Length of returned result should as top_n.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ " Size (MB) Description\n", "bert 425.6 Google BERT BASE parameters\n", "tiny-bert 57.4 Google BERT TINY parameters\n", "albert 48.6 Google ALBERT BASE parameters\n", "tiny-albert 22.4 Google ALBERT TINY parameters\n", "xlnet 446.6 Google XLNET BASE parameters\n", "alxlnet 46.8 Malaya ALXLNET BASE parameters\n", "electra 443 Google ELECTRA BASE parameters\n", "small-electra 55 Google ELECTRA SMALL parameters" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.transformer.available_transformer()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /home/husein/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:206: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use `tf.random.categorical` instead.\n", "INFO:tensorflow:Restoring parameters from /home/husein/Malaya/electra-model/base/electra-base/model.ckpt\n" ] } ], "source": [ "electra = malaya.transformer.load(model = 'electra')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "de32c6b45dce4597a8ea4b14e091b800", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading: 0%| | 0.00/413M [00:00