{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lexicon Generator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/lexicon](https://github.com/huseinzol05/Malaya/tree/master/example/lexicon).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.47 s, sys: 1.01 s, total: 5.48 s\n", "Wall time: 5.37 s\n" ] } ], "source": [ "%%time\n", "import malaya\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why lexicon\n", "\n", "Lexicon is populated words related to certain domains, like, words for negative and positive sentiments.\n", "\n", "Example, word `suka` can represent as positive sentiment. If `suka` exists in a sentence, we can say that sentence is positive sentiment.\n", "\n", "Lexicon based is common way people use to classify a text and very fast. Again, it is pretty naive because a word can be semantically ambiguous." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### sentiment lexicon\n", "\n", "Malaya provided a small sample for sentiment lexicon, simply," ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['negative', 'positive'])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentiment_lexicon = malaya.lexicon.sentiment\n", "sentiment_lexicon.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### emotion lexicon\n", "\n", "Malaya provided a small sample for emotion lexicon, simply," ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "emotion_lexicon = malaya.lexicon.emotion\n", "emotion_lexicon.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lexicon generator\n", "\n", "To build a lexicon is time consuming, because required expert domains to populate related words to the domains. With the help of word vector, we can induce sample words to specific domains given some annotated lexicon. Why we induced lexicon from word vector? Even for a word `suka` commonly represent positive sentiment, but if the word vector learnt the context of `suka` different polarity and based nearest words also represent different polarity, so `suka` got tendency to become negative sentiment.\n", "\n", "Malaya provided inducing lexicon interface, build on top of [Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora](https://arxiv.org/pdf/1606.02820.pdf).\n", "\n", "Let say you have a lexicon based on standard language or `bahasa baku`, then you want to find similar lexicon on social media context. So you can use this `malaya.lexicon` interface. To use this interface, we must initiate `malaya.wordvector.load` first.\n", "\n", "And, at least small lexicon sample like this,\n", "\n", "```python\n", "{'label1': ['word1', 'word2'], 'label2': ['word3', 'word4']}\n", "```\n", "\n", "`label` can be more than 2, example like `malaya.lexicon.emotion`, up to 6 different labels." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "vocab, embedded = malaya.wordvector.load(model = 'socialmedia')\n", "wordvector = malaya.wordvector.WordVector(embedded, vocab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### random walk\n", "\n", "Random walk technique is main technique use by the paper, can read more at [3.2 Propagating polarities from a seed set](https://arxiv.org/abs/1606.02820)\n", "\n", "```python\n", "\n", "def random_walk(\n", " lexicon,\n", " wordvector,\n", " pool_size = 10,\n", " top_n = 20,\n", " similarity_power = 100.0,\n", " beta = 0.9,\n", " arccos = True,\n", " normalization = True,\n", " soft = False,\n", " silent = False,\n", "):\n", "\n", " \"\"\"\n", " Induce lexicon by using random walk technique, use in paper, https://arxiv.org/pdf/1606.02820.pdf\n", "\n", " Parameters\n", " ----------\n", "\n", " lexicon: dict\n", " curated lexicon from expert domain, {'label1': [str], 'label2': [str]}.\n", " wordvector: object\n", " wordvector interface object.\n", " pool_size: int, optional (default=10)\n", " pick top-pool size from each lexicons.\n", " top_n: int, optional (default=20)\n", " top_n for each vectors will multiple with `similarity_power`.\n", " similarity_power: float, optional (default=100.0)\n", " extra score for `top_n`, less will generate less bias induced but high chance unbalanced outcome.\n", " beta: float, optional (default=0.9)\n", " penalty score, towards to 1.0 means less penalty. 0 < beta < 1.\n", " arccos: bool, optional (default=True)\n", " covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.\n", " normalization: bool, optional (default=True)\n", " normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.\n", " soft: bool, optional (default=False)\n", " if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio.\n", " if False, it will throw an exception if a word not in the dictionary.\n", " silent: bool, optional (default=False)\n", " if True, will not print any logs.\n", " \n", " Returns\n", " -------\n", " tuple: (labels[argmax(scores), axis = 1], scores, labels)\n", " \n", " \"\"\"\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "populating nearest words from wordvector\n", "populating vectors from populated nearest words\n", "random walking from populated vectors \n", "\n", "CPU times: user 1min 36s, sys: 16.1 s, total: 1min 52s\n", "Wall time: 28.1 s\n" ] } ], "source": [ "%%time\n", "\n", "results, scores, labels = malaya.lexicon.random_walk(sentiment_lexicon, wordvector, pool_size = 5)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array(['negative', 'positive'], dtype='