{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Word Vector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/wordvector](https://github.com/huseinzol05/Malaya/tree/master/example/wordvector).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pretrained word2vec\n", "\n", "You can download Malaya pretrained without need to import malaya.\n", "\n", "#### word2vec from local news\n", "\n", "[size-256](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/wordvector#download)\n", "\n", "#### word2vec from wikipedia\n", "\n", "[size-256](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/wordvector#download)\n", "\n", "#### word2vec from local social media\n", "\n", "[size-256](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/wordvector#download)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.87 s, sys: 2.61 s, total: 5.48 s\n", "Wall time: 2.65 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available pretrained word2vec" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'news': {'Size (MB)': 200.2,\n", " 'Vocab size': 195466,\n", " 'lowercase': True,\n", " 'Description': 'pretrained on cleaned Malay news',\n", " 'dimension': 256},\n", " 'wikipedia': {'Size (MB)': 781.7,\n", " 'Vocab size': 763350,\n", " 'lowercase': True,\n", " 'Description': 'pretrained on Malay wikipedia',\n", " 'dimension': 256},\n", " 'socialmedia': {'Size (MB)': 1300,\n", " 'Vocab size': 1294638,\n", " 'lowercase': True,\n", " 'Description': 'pretrained on cleaned Malay twitter and Malay instagram',\n", " 'dimension': 256},\n", " 'combine': {'Size (MB)': 1900,\n", " 'Vocab size': 1903143,\n", " 'lowercase': True,\n", " 'Description': 'pretrained on cleaned Malay news + Malay social media + Malay wikipedia',\n", " 'dimension': 256},\n", " 'socialmedia-v2': {'Size (MB)': 1300,\n", " 'Vocab size': 1294638,\n", " 'lowercase': True,\n", " 'Description': 'pretrained on twitter + lowyat + carigold + b.cari.com.my + facebook + IIUM Confession + Common Crawl',\n", " 'dimension': 256}}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.wordvector.available_wordvector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load pretrained word2vec\n", "\n", "```python\n", "def load(model: str = 'wikipedia', **kwargs):\n", "\n", " \"\"\"\n", " Return malaya.wordvector.WordVector object.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='wikipedia')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'wikipedia'`` - pretrained on Malay wikipedia word2vec size 256.\n", " * ``'socialmedia'`` - pretrained on cleaned Malay twitter and Malay instagram size 256.\n", " * ``'news'`` - pretrained on cleaned Malay news size 256.\n", " * ``'combine'`` - pretrained on cleaned Malay news + Malay social media + Malay wikipedia size 256.\n", "\n", " Returns\n", " -------\n", " vocabulary: indices dictionary for `vector`.\n", " vector: np.array, 2D.\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "vocab_news, embedded_news = malaya.wordvector.load(model = 'news')\n", "vocab_wiki, embedded_wiki = malaya.wordvector.load(model = 'wikipedia')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 2 }