Word Vector

This tutorial is available as an IPython notebook at Malaya/example/wordvector.

Pretrained word2vec

You can download Malaya pretrained without need to import malaya.

word2vec from local news

size-256

word2vec from wikipedia

size-256

word2vec from local social media

size-256

But If you don’t know what to do with malaya word2vec, Malaya provided some useful functions for you!

[1]:
%%time
import malaya
%matplotlib inline
CPU times: user 5.15 s, sys: 907 ms, total: 6.05 s
Wall time: 6.25 s

List available pretrained word2vec

[2]:
malaya.wordvector.available_wordvector()
[2]:
Size (MB) Vocab size lowercase Description
wikipedia 781.7 763350 True pretrained on Malay wikipedia word2vec size 256
socialmedia 1300 1294638 True pretrained on cleaned Malay twitter and Malay ...
news 200.2 195466 True pretrained on cleaned Malay news size 256
combine 1900 1903143 True pretrained on cleaned Malay news + Malay socia...

Load pretrained word2vec

def load(model: str = 'wikipedia', **kwargs):

    """
    Return malaya.wordvector.WordVector object.

    Parameters
    ----------
    model : str, optional (default='wikipedia')
        Model architecture supported. Allowed values:

        * ``'wikipedia'`` - pretrained on Malay wikipedia word2vec size 256.
        * ``'socialmedia'`` - pretrained on cleaned Malay twitter and Malay instagram size 256.
        * ``'news'`` - pretrained on cleaned Malay news size 256.
        * ``'combine'`` - pretrained on cleaned Malay news + Malay social media + Malay wikipedia size 256.

    Returns
    -------
    vocabulary: indices dictionary for `vector`.
    vector: np.array, 2D.
    """
[3]:
vocab_news, embedded_news = malaya.wordvector.load(model = 'news')
vocab_wiki, embedded_wiki = malaya.wordvector.load(model = 'wikipedia')

Load word vector interface

class WordVector:
    @check_type
    def __init__(self, embed_matrix, dictionary: dict, **kwargs):

        """
        Parameters
        ----------
        embed_matrix: numpy array
        dictionary: dictionary
        """
  1. embed_matrix must be a 2d,

array([[ 0.25      , -0.10816103, -0.19881412, ...,  0.40432587,
         0.19388093, -0.07062137],
       [ 0.3231817 , -0.01318745, -0.17950962, ...,  0.25      ,
         0.08444146, -0.11705721],
       [ 0.29103908, -0.16274083, -0.20255531, ...,  0.25      ,
         0.06253044, -0.16404966],
       ...,
       [ 0.21346697,  0.12686132, -0.4029543 , ...,  0.43466234,
         0.20910986, -0.32219803],
       [ 0.2372157 ,  0.32420087, -0.28036436, ...,  0.2894639 ,
         0.20745888, -0.30600077],
       [ 0.27907744,  0.35755727, -0.34932107, ...,  0.37472805,
         0.42045262, -0.21725406]], dtype=float32)
  1. dictionary, a dictionary mapped {'word': 0},

{'mengembanfkan': 394623,
 'dipujanya': 234554,
 'comicolor': 182282,
 'immaz': 538660,
 'qabar': 585119,
 'phidippus': 180802,
}

Load custom word vector

Like fast-text, example, I download from here, https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ms.vec

We need to parse the data to get embed_matrix and dictionary.

[ ]:
import io
import numpy as np

fin = io.open('wiki.ms.vec', 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())

data, vectors = {}, []
for no, line in enumerate(fin):
    tokens = line.rstrip().split(' ')
    data[tokens[0]] = no
    vectors.append(list(map(float, tokens[1:])))

vectors = np.array(vectors)
fast_text = malaya.wordvector.WordVector(vectors, data)
[5]:
word_vector_news = malaya.wordvector.WordVector(embedded_news, vocab_news)
word_vector_wiki = malaya.wordvector.WordVector(embedded_wiki, vocab_wiki)

Check top-k similar semantics based on a word

def n_closest(
    self,
    word: str,
    num_closest: int = 5,
    metric: str = 'cosine',
    return_similarity: bool = True,
):
    """
    find nearest words based on a word.

    Parameters
    ----------
    word: str
        Eg, 'najib'
    num_closest: int, (default=5)
        number of words closest to the result.
    metric: str, (default='cosine')
        vector distance algorithm.
    return_similarity: bool, (default=True)
        if True, will return between 0-1 represents the distance.

    Returns
    -------
    word_list: list of nearest words
    """
[6]:
word = 'anwar'
print("Embedding layer: 8 closest words to: '%s' using malaya news word2vec"%(word))
print(word_vector_news.n_closest(word=word, num_closest=8, metric='cosine'))
Embedding layer: 8 closest words to: 'anwar' using malaya news word2vec
[['najib', 0.6967672109603882], ['mukhriz', 0.675892174243927], ['azmin', 0.6686884164810181], ['rafizi', 0.6465028524398804], ['muhyiddin', 0.6413404941558838], ['daim', 0.6334482431411743], ['khairuddin', 0.6300410032272339], ['shahidan', 0.6269811391830444]]
[12]:
word = 'anwar'
print("Embedding layer: 8 closest words to: '%s' using malaya wiki word2vec"%(word))
print(word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine'))
Embedding layer: 8 closest words to: 'anwar' using malaya wiki word2vec
[['rasulullah', 0.6918460130691528], ['jamal', 0.6604709029197693], ['noraniza', 0.65153968334198], ['khalid', 0.6450133323669434], ['mahathir', 0.6447468400001526], ['sukarno', 0.641593337059021], ['wahid', 0.6359774470329285], ['pekin', 0.6262176036834717]]

Check batch top-k similar semantics based on a word

def batch_n_closest(
    self,
    words: List[str],
    num_closest: int = 5,
    return_similarity: bool = False,
    soft: bool = True,
):
    """
    find nearest words based on a batch of words using Tensorflow.

    Parameters
    ----------
    words: list
        Eg, ['najib','anwar']
    num_closest: int, (default=5)
        number of words closest to the result.
    return_similarity: bool, (default=True)
        if True, will return between 0-1 represents the distance.
    soft: bool, (default=True)
        if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio.
        if False, it will throw an exception if a word not in the dictionary.

    Returns
    -------
    word_list: list of nearest words
    """
[13]:
words = ['anwar', 'mahathir']
word_vector_news.batch_n_closest(words, num_closest=8,
                                 return_similarity=False)
[13]:
[['anwar',
  'najib',
  'mukhriz',
  'azmin',
  'rafizi',
  'muhyiddin',
  'daim',
  'khairuddin'],
 ['mahathir',
  'daim',
  'sahruddin',
  'streram',
  'morsi',
  'anifah',
  'jokowi',
  'ramasamy']]

What happen if a word not in the dictionary?

You can set parameter soft to True or False. Default is True.

if True, a word not in the dictionary will be replaced with nearest JaroWrinkler ratio.

if False, it will throw an exception if a word not in the dictionary.

[14]:
words = ['anwar', 'mahathir','husein-comel']
word_vector_wiki.batch_n_closest(words, num_closest=8,
                                 return_similarity=False,soft=False)
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-14-50a78d59e7a9> in <module>
      1 words = ['anwar', 'mahathir','husein-comel']
      2 word_vector_wiki.batch_n_closest(words, num_closest=8,
----> 3                                  return_similarity=False,soft=False)

~/Documents/Malaya/malaya/wordvector.py in batch_n_closest(self, words, num_closest, return_similarity, soft)
    484                     raise Exception(
    485                         '%s not in dictionary, please use another word or set `soft` = True'
--> 486                         % (words[i])
    487                     )
    488         batches = np.array([self.get_vector_by_name(w) for w in words])

Exception: husein-comel not in dictionary, please use another word or set `soft` = True
[15]:
words = ['anwar', 'mahathir','husein-comel']
word_vector_wiki.batch_n_closest(words, num_closest=8,
                                 return_similarity=False,soft=True)
[15]:
[['anwar',
  'rasulullah',
  'jamal',
  'noraniza',
  'khalid',
  'mahathir',
  'sukarno',
  'wahid'],
 ['mahathir',
  'anwar',
  'wahid',
  'najib',
  'khalid',
  'sukarno',
  'suharto',
  'salahuddin'],
 ['husein',
  'khairi',
  'gccsa',
  'jkrte',
  'montagny',
  'pejudo',
  'badriyyin',
  'naginatajutsu']]

Word2vec calculator

You can put any equation you wanted.

def calculator(
    self,
    equation: str,
    num_closest: int = 5,
    metric: str = 'cosine',
    return_similarity: bool = True,
):
    """
    calculator parser for word2vec.

    Parameters
    ----------
    equation: str
        Eg, '(mahathir + najib) - rosmah'
    num_closest: int, (default=5)
        number of words closest to the result.
    metric: str, (default='cosine')
        vector distance algorithm.
    return_similarity: bool, (default=True)
        if True, will return between 0-1 represents the distance.

    Returns
    -------
    word_list: list of nearest words
    """
[18]:
word_vector_news.calculator('anwar + amerika + mahathir', num_closest=8, metric='cosine',
                      return_similarity=False)
[18]:
['mahathir',
 'anwar',
 'trump',
 'duterte',
 'netanyahu',
 'jokowi',
 'rusia',
 'kj',
 'obama']
[19]:
word_vector_wiki.calculator('anwar + amerika + mahathir', num_closest=8, metric='cosine',
                      return_similarity=False)
[19]:
['mahathir',
 'anwar',
 'sukarno',
 'suharto',
 'hamas',
 'sparta',
 'amerika',
 'iraq',
 'lubnan']

Visualize scatter-plot

def scatter_plot(
    self,
    labels,
    centre: str = None,
    figsize: Tuple[int, int] = (7, 7),
    plus_minus: int = 25,
    handoff: float = 5e-5,
):
    """
    plot a scatter plot based on output from calculator / n_closest / analogy.

    Parameters
    ----------
    labels : list
        output from calculator / n_closest / analogy
    centre : str, (default=None)
        centre label, if a str, it will annotate in a red color.
    figsize : tuple, (default=(7, 7))
        figure size for plot.

    Returns
    -------
    tsne: np.array, 2D.
    """
[20]:
word = 'anwar'
result = word_vector_news.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_news.scatter_plot(result, centre = word)
_images/load-wordvector_25_0.png
[21]:
word = 'anwar'
result = word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_wiki.scatter_plot(result, centre = word)
_images/load-wordvector_26_0.png

Visualize tree-plot

def tree_plot(
    self, labels, figsize: Tuple[int, int] = (7, 7), annotate: bool = True
):
    """
    plot a tree plot based on output from calculator / n_closest / analogy.

    Parameters
    ----------
    labels : list
        output from calculator / n_closest / analogy.
    visualize : bool
        if True, it will render plt.show, else return data.
    figsize : tuple, (default=(7, 7))
        figure size for plot.

    Returns
    -------
    embed: np.array, 2D.
    labelled: labels for X / Y axis.
    """
[22]:
word = 'anwar'
result = word_vector_news.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_news.tree_plot(result)
<Figure size 504x504 with 0 Axes>
_images/load-wordvector_28_1.png
[23]:
word = 'anwar'
result = word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_wiki.tree_plot(result)
<Figure size 504x504 with 0 Axes>
_images/load-wordvector_29_1.png

Visualize social-network

def network(
    self,
    word,
    num_closest = 8,
    depth = 4,
    min_distance = 0.5,
    iteration = 300,
    figsize = (15, 15),
    node_color = '#72bbd0',
    node_factor = 50,
):

    """
    plot a social network based on word given

    Parameters
    ----------
    word : str
        centre of social network.
    num_closest: int, (default=8)
        number of words closest to the node.
    depth: int, (default=4)
        depth of social network. More deeper more expensive to calculate, big^O(num_closest ** depth).
    min_distance: float, (default=0.5)
        minimum distance among nodes. Increase the value to increase the distance among nodes.
    iteration: int, (default=300)
        number of loops to train the social network to fit min_distace.
    figsize: tuple, (default=(15, 15))
        figure size for plot.
    node_color: str, (default='#72bbd0')
        color for nodes.
    node_factor: int, (default=10)
        size factor for depth nodes. Increase this value will increase nodes sizes based on depth.
[24]:
g = word_vector_news.network('mahathir', figsize = (10, 10), node_factor = 50, depth = 3)
_images/load-wordvector_31_0.png
[25]:
g = word_vector_wiki.network('mahathir', figsize = (10, 10), node_factor = 50, depth = 3)
_images/load-wordvector_32_0.png

Get embedding from a word

def get_vector_by_name(
    self, word: str, soft: bool = False, topn_soft: int = 5
):
    """
    get vector based on string.

    Parameters
    ----------
    word: str
    soft: bool, (default=True)
        if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio.
        if False, it will throw an exception if a word not in the dictionary.
    topn_soft: int, (default=5)
        if word not found in dictionary, will returned `topn_soft` size of similar size using jarowinkler.

    Returns
    -------
    vector: np.array, 1D
    """
[28]:
word_vector_wiki.get_vector_by_name('najib').shape
[28]:
(256,)

If a word not found in the vocabulary, it will throw an exception with top-5 nearest words

[26]:
word_vector_wiki.get_vector_by_name('husein-comel')
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-26-0460b04adbfb> in <module>
----> 1 word_vector_wiki.get_vector_by_name('husein-comel')

~/Documents/Malaya/malaya/wordvector.py in get_vector_by_name(self, word)
    127             raise Exception(
    128                 'input not found in dictionary, here top-5 nearest words [%s]'
--> 129                 % (strings)
    130             )
    131         return self._embed_matrix[self._dictionary[word]]

Exception: input not found in dictionary, here top-5 nearest words [husein, husei, husenil, husen, secomel]