Word Vector
Contents
Word Vector#
This tutorial is available as an IPython notebook at Malaya/example/wordvector.
Pretrained word2vec#
You can download Malaya pretrained without need to import malaya.
word2vec from local social media#
[1]:
%%time
import malaya
CPU times: user 2.87 s, sys: 2.61 s, total: 5.48 s
Wall time: 2.65 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
List available pretrained word2vec#
[2]:
malaya.wordvector.available_wordvector
[2]:
{'news': {'Size (MB)': 200.2,
'Vocab size': 195466,
'lowercase': True,
'Description': 'pretrained on cleaned Malay news',
'dimension': 256},
'wikipedia': {'Size (MB)': 781.7,
'Vocab size': 763350,
'lowercase': True,
'Description': 'pretrained on Malay wikipedia',
'dimension': 256},
'socialmedia': {'Size (MB)': 1300,
'Vocab size': 1294638,
'lowercase': True,
'Description': 'pretrained on cleaned Malay twitter and Malay instagram',
'dimension': 256},
'combine': {'Size (MB)': 1900,
'Vocab size': 1903143,
'lowercase': True,
'Description': 'pretrained on cleaned Malay news + Malay social media + Malay wikipedia',
'dimension': 256},
'socialmedia-v2': {'Size (MB)': 1300,
'Vocab size': 1294638,
'lowercase': True,
'Description': 'pretrained on twitter + lowyat + carigold + b.cari.com.my + facebook + IIUM Confession + Common Crawl',
'dimension': 256}}
Load pretrained word2vec#
def load(model: str = 'wikipedia', **kwargs):
"""
Return malaya.wordvector.WordVector object.
Parameters
----------
model : str, optional (default='wikipedia')
Model architecture supported. Allowed values:
* ``'wikipedia'`` - pretrained on Malay wikipedia word2vec size 256.
* ``'socialmedia'`` - pretrained on cleaned Malay twitter and Malay instagram size 256.
* ``'news'`` - pretrained on cleaned Malay news size 256.
* ``'combine'`` - pretrained on cleaned Malay news + Malay social media + Malay wikipedia size 256.
Returns
-------
vocabulary: indices dictionary for `vector`.
vector: np.array, 2D.
"""
[3]:
vocab_news, embedded_news = malaya.wordvector.load(model = 'news')
vocab_wiki, embedded_wiki = malaya.wordvector.load(model = 'wikipedia')