Word Vector¶
This tutorial is available as an IPython notebook at Malaya/example/wordvector.
Pretrained word2vec¶
You can download Malaya pretrained without need to import malaya.
word2vec from local news¶
word2vec from wikipedia¶
word2vec from local social media¶
But If you don’t know what to do with malaya word2vec, Malaya provided some useful functions for you!
[1]:
%%time
import malaya
%matplotlib inline
CPU times: user 5.15 s, sys: 907 ms, total: 6.05 s
Wall time: 6.25 s
List available pretrained word2vec¶
[2]:
malaya.wordvector.available_wordvector()
[2]:
Size (MB) | Vocab size | lowercase | Description | |
---|---|---|---|---|
wikipedia | 781.7 | 763350 | True | pretrained on Malay wikipedia word2vec size 256 |
socialmedia | 1300 | 1294638 | True | pretrained on cleaned Malay twitter and Malay ... |
news | 200.2 | 195466 | True | pretrained on cleaned Malay news size 256 |
combine | 1900 | 1903143 | True | pretrained on cleaned Malay news + Malay socia... |
Load pretrained word2vec¶
def load(model: str = 'wikipedia', **kwargs):
"""
Return malaya.wordvector.WordVector object.
Parameters
----------
model : str, optional (default='wikipedia')
Model architecture supported. Allowed values:
* ``'wikipedia'`` - pretrained on Malay wikipedia word2vec size 256.
* ``'socialmedia'`` - pretrained on cleaned Malay twitter and Malay instagram size 256.
* ``'news'`` - pretrained on cleaned Malay news size 256.
* ``'combine'`` - pretrained on cleaned Malay news + Malay social media + Malay wikipedia size 256.
Returns
-------
vocabulary: indices dictionary for `vector`.
vector: np.array, 2D.
"""
[3]:
vocab_news, embedded_news = malaya.wordvector.load(model = 'news')
vocab_wiki, embedded_wiki = malaya.wordvector.load(model = 'wikipedia')
Load word vector interface¶
class WordVector:
@check_type
def __init__(self, embed_matrix, dictionary: dict, **kwargs):
"""
Parameters
----------
embed_matrix: numpy array
dictionary: dictionary
"""
embed_matrix
must be a 2d,
array([[ 0.25 , -0.10816103, -0.19881412, ..., 0.40432587,
0.19388093, -0.07062137],
[ 0.3231817 , -0.01318745, -0.17950962, ..., 0.25 ,
0.08444146, -0.11705721],
[ 0.29103908, -0.16274083, -0.20255531, ..., 0.25 ,
0.06253044, -0.16404966],
...,
[ 0.21346697, 0.12686132, -0.4029543 , ..., 0.43466234,
0.20910986, -0.32219803],
[ 0.2372157 , 0.32420087, -0.28036436, ..., 0.2894639 ,
0.20745888, -0.30600077],
[ 0.27907744, 0.35755727, -0.34932107, ..., 0.37472805,
0.42045262, -0.21725406]], dtype=float32)
dictionary
, a dictionary mapped{'word': 0}
,
{'mengembanfkan': 394623,
'dipujanya': 234554,
'comicolor': 182282,
'immaz': 538660,
'qabar': 585119,
'phidippus': 180802,
}
Load custom word vector¶
Like fast-text, example, I download from here, https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ms.vec
We need to parse the data to get embed_matrix
and dictionary
.
[ ]:
import io
import numpy as np
fin = io.open('wiki.ms.vec', 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
data, vectors = {}, []
for no, line in enumerate(fin):
tokens = line.rstrip().split(' ')
data[tokens[0]] = no
vectors.append(list(map(float, tokens[1:])))
vectors = np.array(vectors)
fast_text = malaya.wordvector.WordVector(vectors, data)
[5]:
word_vector_news = malaya.wordvector.WordVector(embedded_news, vocab_news)
word_vector_wiki = malaya.wordvector.WordVector(embedded_wiki, vocab_wiki)
Check top-k similar semantics based on a word¶
def n_closest(
self,
word: str,
num_closest: int = 5,
metric: str = 'cosine',
return_similarity: bool = True,
):
"""
find nearest words based on a word.
Parameters
----------
word: str
Eg, 'najib'
num_closest: int, (default=5)
number of words closest to the result.
metric: str, (default='cosine')
vector distance algorithm.
return_similarity: bool, (default=True)
if True, will return between 0-1 represents the distance.
Returns
-------
word_list: list of nearest words
"""
[6]:
word = 'anwar'
print("Embedding layer: 8 closest words to: '%s' using malaya news word2vec"%(word))
print(word_vector_news.n_closest(word=word, num_closest=8, metric='cosine'))
Embedding layer: 8 closest words to: 'anwar' using malaya news word2vec
[['najib', 0.6967672109603882], ['mukhriz', 0.675892174243927], ['azmin', 0.6686884164810181], ['rafizi', 0.6465028524398804], ['muhyiddin', 0.6413404941558838], ['daim', 0.6334482431411743], ['khairuddin', 0.6300410032272339], ['shahidan', 0.6269811391830444]]
[12]:
word = 'anwar'
print("Embedding layer: 8 closest words to: '%s' using malaya wiki word2vec"%(word))
print(word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine'))
Embedding layer: 8 closest words to: 'anwar' using malaya wiki word2vec
[['rasulullah', 0.6918460130691528], ['jamal', 0.6604709029197693], ['noraniza', 0.65153968334198], ['khalid', 0.6450133323669434], ['mahathir', 0.6447468400001526], ['sukarno', 0.641593337059021], ['wahid', 0.6359774470329285], ['pekin', 0.6262176036834717]]
Check batch top-k similar semantics based on a word¶
def batch_n_closest(
self,
words: List[str],
num_closest: int = 5,
return_similarity: bool = False,
soft: bool = True,
):
"""
find nearest words based on a batch of words using Tensorflow.
Parameters
----------
words: list
Eg, ['najib','anwar']
num_closest: int, (default=5)
number of words closest to the result.
return_similarity: bool, (default=True)
if True, will return between 0-1 represents the distance.
soft: bool, (default=True)
if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio.
if False, it will throw an exception if a word not in the dictionary.
Returns
-------
word_list: list of nearest words
"""
[13]:
words = ['anwar', 'mahathir']
word_vector_news.batch_n_closest(words, num_closest=8,
return_similarity=False)
[13]:
[['anwar',
'najib',
'mukhriz',
'azmin',
'rafizi',
'muhyiddin',
'daim',
'khairuddin'],
['mahathir',
'daim',
'sahruddin',
'streram',
'morsi',
'anifah',
'jokowi',
'ramasamy']]
What happen if a word not in the dictionary?
You can set parameter soft
to True
or False
. Default is True
.
if True
, a word not in the dictionary will be replaced with nearest JaroWrinkler ratio.
if False
, it will throw an exception if a word not in the dictionary.
[14]:
words = ['anwar', 'mahathir','husein-comel']
word_vector_wiki.batch_n_closest(words, num_closest=8,
return_similarity=False,soft=False)
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-14-50a78d59e7a9> in <module>
1 words = ['anwar', 'mahathir','husein-comel']
2 word_vector_wiki.batch_n_closest(words, num_closest=8,
----> 3 return_similarity=False,soft=False)
~/Documents/Malaya/malaya/wordvector.py in batch_n_closest(self, words, num_closest, return_similarity, soft)
484 raise Exception(
485 '%s not in dictionary, please use another word or set `soft` = True'
--> 486 % (words[i])
487 )
488 batches = np.array([self.get_vector_by_name(w) for w in words])
Exception: husein-comel not in dictionary, please use another word or set `soft` = True
[15]:
words = ['anwar', 'mahathir','husein-comel']
word_vector_wiki.batch_n_closest(words, num_closest=8,
return_similarity=False,soft=True)
[15]:
[['anwar',
'rasulullah',
'jamal',
'noraniza',
'khalid',
'mahathir',
'sukarno',
'wahid'],
['mahathir',
'anwar',
'wahid',
'najib',
'khalid',
'sukarno',
'suharto',
'salahuddin'],
['husein',
'khairi',
'gccsa',
'jkrte',
'montagny',
'pejudo',
'badriyyin',
'naginatajutsu']]
Word2vec calculator¶
You can put any equation you wanted.
def calculator(
self,
equation: str,
num_closest: int = 5,
metric: str = 'cosine',
return_similarity: bool = True,
):
"""
calculator parser for word2vec.
Parameters
----------
equation: str
Eg, '(mahathir + najib) - rosmah'
num_closest: int, (default=5)
number of words closest to the result.
metric: str, (default='cosine')
vector distance algorithm.
return_similarity: bool, (default=True)
if True, will return between 0-1 represents the distance.
Returns
-------
word_list: list of nearest words
"""
[18]:
word_vector_news.calculator('anwar + amerika + mahathir', num_closest=8, metric='cosine',
return_similarity=False)
[18]:
['mahathir',
'anwar',
'trump',
'duterte',
'netanyahu',
'jokowi',
'rusia',
'kj',
'obama']
[19]:
word_vector_wiki.calculator('anwar + amerika + mahathir', num_closest=8, metric='cosine',
return_similarity=False)
[19]:
['mahathir',
'anwar',
'sukarno',
'suharto',
'hamas',
'sparta',
'amerika',
'iraq',
'lubnan']
Visualize scatter-plot¶
def scatter_plot(
self,
labels,
centre: str = None,
figsize: Tuple[int, int] = (7, 7),
plus_minus: int = 25,
handoff: float = 5e-5,
):
"""
plot a scatter plot based on output from calculator / n_closest / analogy.
Parameters
----------
labels : list
output from calculator / n_closest / analogy
centre : str, (default=None)
centre label, if a str, it will annotate in a red color.
figsize : tuple, (default=(7, 7))
figure size for plot.
Returns
-------
tsne: np.array, 2D.
"""
[20]:
word = 'anwar'
result = word_vector_news.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_news.scatter_plot(result, centre = word)

[21]:
word = 'anwar'
result = word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_wiki.scatter_plot(result, centre = word)

Visualize tree-plot¶
def tree_plot(
self, labels, figsize: Tuple[int, int] = (7, 7), annotate: bool = True
):
"""
plot a tree plot based on output from calculator / n_closest / analogy.
Parameters
----------
labels : list
output from calculator / n_closest / analogy.
visualize : bool
if True, it will render plt.show, else return data.
figsize : tuple, (default=(7, 7))
figure size for plot.
Returns
-------
embed: np.array, 2D.
labelled: labels for X / Y axis.
"""
[22]:
word = 'anwar'
result = word_vector_news.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_news.tree_plot(result)
<Figure size 504x504 with 0 Axes>

[23]:
word = 'anwar'
result = word_vector_wiki.n_closest(word=word, num_closest=8, metric='cosine')
data = word_vector_wiki.tree_plot(result)
<Figure size 504x504 with 0 Axes>

Visualize social-network¶
def network(
self,
word,
num_closest = 8,
depth = 4,
min_distance = 0.5,
iteration = 300,
figsize = (15, 15),
node_color = '#72bbd0',
node_factor = 50,
):
"""
plot a social network based on word given
Parameters
----------
word : str
centre of social network.
num_closest: int, (default=8)
number of words closest to the node.
depth: int, (default=4)
depth of social network. More deeper more expensive to calculate, big^O(num_closest ** depth).
min_distance: float, (default=0.5)
minimum distance among nodes. Increase the value to increase the distance among nodes.
iteration: int, (default=300)
number of loops to train the social network to fit min_distace.
figsize: tuple, (default=(15, 15))
figure size for plot.
node_color: str, (default='#72bbd0')
color for nodes.
node_factor: int, (default=10)
size factor for depth nodes. Increase this value will increase nodes sizes based on depth.
[24]:
g = word_vector_news.network('mahathir', figsize = (10, 10), node_factor = 50, depth = 3)

[25]:
g = word_vector_wiki.network('mahathir', figsize = (10, 10), node_factor = 50, depth = 3)

Get embedding from a word¶
def get_vector_by_name(
self, word: str, soft: bool = False, topn_soft: int = 5
):
"""
get vector based on string.
Parameters
----------
word: str
soft: bool, (default=True)
if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio.
if False, it will throw an exception if a word not in the dictionary.
topn_soft: int, (default=5)
if word not found in dictionary, will returned `topn_soft` size of similar size using jarowinkler.
Returns
-------
vector: np.array, 1D
"""
[28]:
word_vector_wiki.get_vector_by_name('najib').shape
[28]:
(256,)
If a word not found in the vocabulary, it will throw an exception with top-5 nearest words
[26]:
word_vector_wiki.get_vector_by_name('husein-comel')
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-26-0460b04adbfb> in <module>
----> 1 word_vector_wiki.get_vector_by_name('husein-comel')
~/Documents/Malaya/malaya/wordvector.py in get_vector_by_name(self, word)
127 raise Exception(
128 'input not found in dictionary, here top-5 nearest words [%s]'
--> 129 % (strings)
130 )
131 return self._embed_matrix[self._dictionary[word]]
Exception: input not found in dictionary, here top-5 nearest words [husein, husei, husenil, husen, secomel]