{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word Vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [Malaya/example/wordvector](https://github.com/huseinzol05/Malaya/tree/master/example/wordvector).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pretrained word2vec\n",
    "\n",
    "You can download Malaya pretrained without need to import malaya.\n",
    "\n",
    "#### word2vec from local news\n",
    "\n",
    "[size-256](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/wordvector#download)\n",
    "\n",
    "#### word2vec from wikipedia\n",
    "\n",
    "[size-256](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/wordvector#download)\n",
    "\n",
    "#### word2vec from local social media\n",
    "\n",
    "[size-256](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/wordvector#download)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.87 s, sys: 2.61 s, total: 5.48 s\n",
      "Wall time: 2.65 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import malaya"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List available pretrained word2vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'news': {'Size (MB)': 200.2,\n",
       "  'Vocab size': 195466,\n",
       "  'lowercase': True,\n",
       "  'Description': 'pretrained on cleaned Malay news',\n",
       "  'dimension': 256},\n",
       " 'wikipedia': {'Size (MB)': 781.7,\n",
       "  'Vocab size': 763350,\n",
       "  'lowercase': True,\n",
       "  'Description': 'pretrained on Malay wikipedia',\n",
       "  'dimension': 256},\n",
       " 'socialmedia': {'Size (MB)': 1300,\n",
       "  'Vocab size': 1294638,\n",
       "  'lowercase': True,\n",
       "  'Description': 'pretrained on cleaned Malay twitter and Malay instagram',\n",
       "  'dimension': 256},\n",
       " 'combine': {'Size (MB)': 1900,\n",
       "  'Vocab size': 1903143,\n",
       "  'lowercase': True,\n",
       "  'Description': 'pretrained on cleaned Malay news + Malay social media + Malay wikipedia',\n",
       "  'dimension': 256},\n",
       " 'socialmedia-v2': {'Size (MB)': 1300,\n",
       "  'Vocab size': 1294638,\n",
       "  'lowercase': True,\n",
       "  'Description': 'pretrained on twitter + lowyat + carigold + b.cari.com.my + facebook + IIUM Confession + Common Crawl',\n",
       "  'dimension': 256}}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya.wordvector.available_wordvector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load pretrained word2vec\n",
    "\n",
    "```python\n",
    "def load(model: str = 'wikipedia', **kwargs):\n",
    "\n",
    "    \"\"\"\n",
    "    Return malaya.wordvector.WordVector object.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model : str, optional (default='wikipedia')\n",
    "        Model architecture supported. Allowed values:\n",
    "\n",
    "        * ``'wikipedia'`` - pretrained on Malay wikipedia word2vec size 256.\n",
    "        * ``'socialmedia'`` - pretrained on cleaned Malay twitter and Malay instagram size 256.\n",
    "        * ``'news'`` - pretrained on cleaned Malay news size 256.\n",
    "        * ``'combine'`` - pretrained on cleaned Malay news + Malay social media + Malay wikipedia size 256.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    vocabulary: indices dictionary for `vector`.\n",
    "    vector: np.array, 2D.\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab_news, embedded_news = malaya.wordvector.load(model = 'news')\n",
    "vocab_wiki, embedded_wiki = malaya.wordvector.load(model = 'wikipedia')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}