Word Vector#

This tutorial is available as an IPython notebook at Malaya/example/wordvector.

Pretrained word2vec#

You can download Malaya pretrained without need to import malaya.

word2vec from local news#

size-256

word2vec from wikipedia#

size-256

word2vec from local social media#

size-256

[1]:
%%time
import malaya
CPU times: user 2.87 s, sys: 2.61 s, total: 5.48 s
Wall time: 2.65 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

List available pretrained word2vec#

[2]:
malaya.wordvector.available_wordvector
[2]:
{'news': {'Size (MB)': 200.2,
  'Vocab size': 195466,
  'lowercase': True,
  'Description': 'pretrained on cleaned Malay news',
  'dimension': 256},
 'wikipedia': {'Size (MB)': 781.7,
  'Vocab size': 763350,
  'lowercase': True,
  'Description': 'pretrained on Malay wikipedia',
  'dimension': 256},
 'socialmedia': {'Size (MB)': 1300,
  'Vocab size': 1294638,
  'lowercase': True,
  'Description': 'pretrained on cleaned Malay twitter and Malay instagram',
  'dimension': 256},
 'combine': {'Size (MB)': 1900,
  'Vocab size': 1903143,
  'lowercase': True,
  'Description': 'pretrained on cleaned Malay news + Malay social media + Malay wikipedia',
  'dimension': 256},
 'socialmedia-v2': {'Size (MB)': 1300,
  'Vocab size': 1294638,
  'lowercase': True,
  'Description': 'pretrained on twitter + lowyat + carigold + b.cari.com.my + facebook + IIUM Confession + Common Crawl',
  'dimension': 256}}

Load pretrained word2vec#

def load(model: str = 'wikipedia', **kwargs):

    """
    Return malaya.wordvector.WordVector object.

    Parameters
    ----------
    model : str, optional (default='wikipedia')
        Model architecture supported. Allowed values:

        * ``'wikipedia'`` - pretrained on Malay wikipedia word2vec size 256.
        * ``'socialmedia'`` - pretrained on cleaned Malay twitter and Malay instagram size 256.
        * ``'news'`` - pretrained on cleaned Malay news size 256.
        * ``'combine'`` - pretrained on cleaned Malay news + Malay social media + Malay wikipedia size 256.

    Returns
    -------
    vocabulary: indices dictionary for `vector`.
    vector: np.array, 2D.
    """
[3]:
vocab_news, embedded_news = malaya.wordvector.load(model = 'news')
vocab_wiki, embedded_wiki = malaya.wordvector.load(model = 'wikipedia')