Word Vector

Pretrained word2vec#

You can download Malaya pretrained without need to import malaya.

word2vec from local news#

word2vec from wikipedia#

List available pretrained word2vec#

[2]:

malaya.wordvector.available_wordvector()

[2]:

	Size (MB)	Vocab size	lowercase	Description
wikipedia	781.7	763350	True	pretrained on Malay wikipedia word2vec size 256
socialmedia	1300	1294638	True	pretrained on cleaned Malay twitter and Malay ...
news	200.2	195466	True	pretrained on cleaned Malay news size 256
combine	1900	1903143	True	pretrained on cleaned Malay news + Malay socia...

Load pretrained word2vec#

def load(model: str = 'wikipedia', **kwargs):

    """
    Return malaya.wordvector.WordVector object.

    Parameters
    ----------
    model : str, optional (default='wikipedia')
        Model architecture supported. Allowed values:

        * ``'wikipedia'`` - pretrained on Malay wikipedia word2vec size 256.
        * ``'socialmedia'`` - pretrained on cleaned Malay twitter and Malay instagram size 256.
        * ``'news'`` - pretrained on cleaned Malay news size 256.
        * ``'combine'`` - pretrained on cleaned Malay news + Malay social media + Malay wikipedia size 256.

    Returns
    -------
    vocabulary: indices dictionary for `vector`.
    vector: np.array, 2D.
    """

[3]:

vocab_news, embedded_news = malaya.wordvector.load(model = 'news')
vocab_wiki, embedded_wiki = malaya.wordvector.load(model = 'wikipedia')

Load word vector interface#

class WordVector:
    @check_type
    def __init__(self, embed_matrix, dictionary: dict, **kwargs):

        """
        Parameters
        ----------
        embed_matrix: numpy array
        dictionary: dictionary
        """

embed_matrix must be a 2d,

array([[ 0.25      , -0.10816103, -0.19881412, ...,  0.40432587,
         0.19388093, -0.07062137],
       [ 0.3231817 , -0.01318745, -0.17950962, ...,  0.25      ,
         0.08444146, -0.11705721],
       [ 0.29103908, -0.16274083, -0.20255531, ...,  0.25      ,
         0.06253044, -0.16404966],
       ...,
       [ 0.21346697,  0.12686132, -0.4029543 , ...,  0.43466234,
         0.20910986, -0.32219803],
       [ 0.2372157 ,  0.32420087, -0.28036436, ...,  0.2894639 ,
         0.20745888, -0.30600077],
       [ 0.27907744,  0.35755727, -0.34932107, ...,  0.37472805,
         0.42045262, -0.21725406]], dtype=float32)

dictionary, a dictionary mapped {'word': 0},

{'mengembanfkan': 394623,
 'dipujanya': 234554,
 'comicolor': 182282,
 'immaz': 538660,
 'qabar': 585119,
 'phidippus': 180802,
}

Load custom word vector#

Like fast-text, example, I download from here, https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ms.vec

We need to parse the data to get embed_matrix and dictionary.

[ ]:

import io
import numpy as np

fin = io.open('wiki.ms.vec', 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())

data, vectors = {}, []
for no, line in enumerate(fin):
    tokens = line.rstrip().split(' ')
    data[tokens[0]] = no
    vectors.append(list(map(float, tokens[1:])))

vectors = np.array(vectors)
fast_text = malaya.wordvector.WordVector(vectors, data)

[5]:

word_vector_news = malaya.wordvector.WordVector(embedded_news, vocab_news)
word_vector_wiki = malaya.wordvector.WordVector(embedded_wiki, vocab_wiki)

Check top-k similar semantics based on a word#

def n_closest(
    self,
    word: str,
    num_closest: int = 5,
    metric: str = 'cosine',
    return_similarity: bool = True,
):
    """
    find nearest words based on a word.

    Parameters
    ----------
    word: str
        Eg, 'najib'
    num_closest: int, (default=5)
        number of words closest to the result.
    metric: str, (default='cosine')
        vector distance algorithm.
    return_similarity: bool, (default=True)
        if True, will return between 0-1 represents the distance.

    Returns
    -------
    word_list: list of nearest words
    """

[6]:

word = 'anwar'
print("Embedding layer: 8 closest words to: '%s' using malaya news word2vec"%(word))
print(word_vector_news.n_closest(word=word, num_closest=8, metric='cosine'))

Embedding layer: 8 closest words to: 'anwar' using malaya news word2vec
[['najib', 0.6967672109603882], ['mukhriz', 0.675892174243927], ['azmin', 0.6686884164810181], ['rafizi', 0.6465028524398804], ['muhyiddin', 0.6413404941558838], ['daim', 0.6334482431411743], ['khairuddin', 0.6300410032272339], ['shahidan', 0.6269811391830444]]

[12]:

word = 'anwar'
print("Embedding layer: 8 closest words to: '%s' using malaya wiki word2vec"%(word))
print(word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine'))

Embedding layer: 8 closest words to: 'anwar' using malaya wiki word2vec
[['rasulullah', 0.6918460130691528], ['jamal', 0.6604709029197693], ['noraniza', 0.65153968334198], ['khalid', 0.6450133323669434], ['mahathir', 0.6447468400001526], ['sukarno', 0.641593337059021], ['wahid', 0.6359774470329285], ['pekin', 0.6262176036834717]]

Check batch top-k similar semantics based on a word#

def batch_n_closest(
    self,
    words: List[str],
    num_closest: int = 5,
    return_similarity: bool = False,
    soft: bool = True,
):
    """
    find nearest words based on a batch of words using Tensorflow.

    Parameters
    ----------
    words: list
        Eg, ['najib','anwar']
    num_closest: int, (default=5)
        number of words closest to the result.
    return_similarity: bool, (default=True)
        if True, will return between 0-1 represents the distance.
    soft: bool, (default=True)
        if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio.
        if False, it will throw an exception if a word not in the dictionary.

    Returns
    -------
    word_list: list of nearest words
    """

[13]:

words = ['anwar', 'mahathir']
word_vector_news.batch_n_closest(words, num_closest=8,
                                 return_similarity=False)

[13]:

[['anwar',
  'najib',
  'mukhriz',
  'azmin',
  'rafizi',
  'muhyiddin',
  'daim',
  'khairuddin'],
 ['mahathir',
  'daim',
  'sahruddin',
  'streram',
  'morsi',
  'anifah',
  'jokowi',
  'ramasamy']]

What happen if a word not in the dictionary?

You can set parameter soft to True or False. Default is True.

if True, a word not in the dictionary will be replaced with nearest JaroWrinkler ratio.

if False, it will throw an exception if a word not in the dictionary.

[14]:

words = ['anwar', 'mahathir','husein-comel']
word_vector_wiki.batch_n_closest(words, num_closest=8,
                                 return_similarity=False,soft=False)

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-14-50a78d59e7a9> in <module>
      1 words = ['anwar', 'mahathir','husein-comel']
      2 word_vector_wiki.batch_n_closest(words, num_closest=8,
----> 3                                  return_similarity=False,soft=False)

~/Documents/Malaya/malaya/wordvector.py in batch_n_closest(self, words, num_closest, return_similarity, soft)
    484                     raise Exception(
    485                         '%s not in dictionary, please use another word or set `soft` = True'
--> 486                         % (words[i])
    487                     )
    488         batches = np.array([self.get_vector_by_name(w) for w in words])

Exception: husein-comel not in dictionary, please use another word or set `soft` = True

[15]:

words = ['anwar', 'mahathir','husein-comel']
word_vector_wiki.batch_n_closest(words, num_closest=8,
                                 return_similarity=False,soft=True)

[15]:

[['anwar',
  'rasulullah',
  'jamal',
  'noraniza',
  'khalid',
  'mahathir',
  'sukarno',
  'wahid'],
 ['mahathir',
  'anwar',
  'wahid',
  'najib',
  'khalid',
  'sukarno',
  'suharto',
  'salahuddin'],
 ['husein',
  'khairi',
  'gccsa',
  'jkrte',
  'montagny',
  'pejudo',
  'badriyyin',
  'naginatajutsu']]

Word2vec calculator#

You can put any equation you wanted.

def calculator(
    self,
    equation: str,
    num_closest: int = 5,
    metric: str = 'cosine',
    return_similarity: bool = True,
):
    """
    calculator parser for word2vec.

    Parameters
    ----------
    equation: str
        Eg, '(mahathir + najib) - rosmah'
    num_closest: int, (default=5)
        number of words closest to the result.
    metric: str, (default='cosine')
        vector distance algorithm.
    return_similarity: bool, (default=True)
        if True, will return between 0-1 represents the distance.

    Returns
    -------
    word_list: list of nearest words
    """

[18]:

word_vector_news.calculator('anwar + amerika + mahathir', num_closest=8, metric='cosine',
                      return_similarity=False)

[18]:

['mahathir',
 'anwar',
 'trump',
 'duterte',
 'netanyahu',
 'jokowi',
 'rusia',
 'kj',
 'obama']

[19]:

word_vector_wiki.calculator('anwar + amerika + mahathir', num_closest=8, metric='cosine',
                      return_similarity=False)

[19]:

['mahathir',
 'anwar',
 'sukarno',
 'suharto',
 'hamas',
 'sparta',
 'amerika',
 'iraq',
 'lubnan']

Visualize scatter-plot#

def scatter_plot(
    self,
    labels,
    centre: str = None,
    figsize: Tuple[int, int] = (7, 7),
    plus_minus: int = 25,
    handoff: float = 5e-5,
):
    """
    plot a scatter plot based on output from calculator / n_closest / analogy.

    Parameters
    ----------
    labels : list
        output from calculator / n_closest / analogy
    centre : str, (default=None)
        centre label, if a str, it will annotate in a red color.
    figsize : tuple, (default=(7, 7))
        figure size for plot.

    Returns
    -------
    tsne: np.array, 2D.
    """

[20]:

word = 'anwar'
result = word_vector_news.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_news.scatter_plot(result, centre = word)

[21]:

word = 'anwar'
result = word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_wiki.scatter_plot(result, centre = word)

Visualize tree-plot#

def tree_plot(
    self, labels, figsize: Tuple[int, int] = (7, 7), annotate: bool = True
):
    """
    plot a tree plot based on output from calculator / n_closest / analogy.

    Parameters
    ----------
    labels : list
        output from calculator / n_closest / analogy.
    visualize : bool
        if True, it will render plt.show, else return data.
    figsize : tuple, (default=(7, 7))
        figure size for plot.

    Returns
    -------
    embed: np.array, 2D.
    labelled: labels for X / Y axis.
    """

[22]:

word = 'anwar'
result = word_vector_news.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_news.tree_plot(result)

<Figure size 504x504 with 0 Axes>

[23]:

word = 'anwar'
result = word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_wiki.tree_plot(result)

<Figure size 504x504 with 0 Axes>

Visualize social-network#

def network(
    self,
    word,
    num_closest = 8,
    depth = 4,
    min_distance = 0.5,
    iteration = 300,
    figsize = (15, 15),
    node_color = '#72bbd0',
    node_factor = 50,
):

    """
    plot a social network based on word given

    Parameters
    ----------
    word : str
        centre of social network.
    num_closest: int, (default=8)
        number of words closest to the node.
    depth: int, (default=4)
        depth of social network. More deeper more expensive to calculate, big^O(num_closest ** depth).
    min_distance: float, (default=0.5)
        minimum distance among nodes. Increase the value to increase the distance among nodes.
    iteration: int, (default=300)
        number of loops to train the social network to fit min_distace.
    figsize: tuple, (default=(15, 15))
        figure size for plot.
    node_color: str, (default='#72bbd0')
        color for nodes.
    node_factor: int, (default=10)
        size factor for depth nodes. Increase this value will increase nodes sizes based on depth.

[24]:

g = word_vector_news.network('mahathir', figsize = (10, 10), node_factor = 50, depth = 3)

[25]:

g = word_vector_wiki.network('mahathir', figsize = (10, 10), node_factor = 50, depth = 3)

Get embedding from a word#

def get_vector_by_name(
    self, word: str, soft: bool = False, topn_soft: int = 5
):
    """
    get vector based on string.

    Parameters
    ----------
    word: str
    soft: bool, (default=True)
        if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio.
        if False, it will throw an exception if a word not in the dictionary.
    topn_soft: int, (default=5)
        if word not found in dictionary, will returned `topn_soft` size of similar size using jarowinkler.

    Returns
    -------
    vector: np.array, 1D
    """

[28]:

word_vector_wiki.get_vector_by_name('najib').shape

[28]:

(256,)

If a word not found in the vocabulary, it will throw an exception with top-5 nearest words

[26]:

word_vector_wiki.get_vector_by_name('husein-comel')

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-26-0460b04adbfb> in <module>
----> 1 word_vector_wiki.get_vector_by_name('husein-comel')

~/Documents/Malaya/malaya/wordvector.py in get_vector_by_name(self, word)
    127             raise Exception(
    128                 'input not found in dictionary, here top-5 nearest words [%s]'
--> 129                 % (strings)
    130             )
    131         return self._embed_matrix[self._dictionary[word]]

Exception: input not found in dictionary, here top-5 nearest words [husein, husei, husenil, husen, secomel]

Contents

Word Vector#

Pretrained word2vec#

word2vec from local news#

word2vec from wikipedia#

List available pretrained word2vec#

Load pretrained word2vec#

Load word vector interface#

Load custom word vector#

Check top-k similar semantics based on a word#

Check batch top-k similar semantics based on a word#

Word2vec calculator#

Visualize scatter-plot#

Visualize tree-plot#

Get embedding from a word#

Word Vector

Contents

Word Vector#

Pretrained word2vec#

word2vec from local news#

word2vec from wikipedia#

word2vec from local social media#

List available pretrained word2vec#

Load pretrained word2vec#

Load word vector interface#

Load custom word vector#

Check top-k similar semantics based on a word#

Check batch top-k similar semantics based on a word#

Word2vec calculator#

Visualize scatter-plot#

Visualize tree-plot#

Visualize social-network#

Get embedding from a word#