Doc2Vec

This tutorial is available as an IPython notebook at Malaya/example/doc2vec.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]:
%%time
import malaya
CPU times: user 4.19 s, sys: 598 ms, total: 4.79 s
Wall time: 4.16 s
[2]:
string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'
string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'
string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'
string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'
[3]:
news1 = 'Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19'
tweet1 = 'DrM sembang pilihan raya tak boleh buat sebab COVID 19'

Doc2Vec using Word Vector

def doc2vec_wordvector(wordvector):
    """
    Doc2vec interface for text similarity using Word Vector.

    Parameters
    ----------
    wordvector : object
        malaya.wordvector.WordVector object.
        should have `get_vector_by_name` method.

    Returns
    -------
    result: malaya.similarity.Doc2VecSimilarity
    """

Using Interface

I will use malaya.wordvector.load(model = 'news'), pretty accurate related to local issues.

[4]:
%%time

vocab_news, embedded_news = malaya.wordvector.load(model = 'news')
w2v = malaya.wordvector.WordVector(embedded_news, vocab_news)
doc2vec = malaya.similarity.doc2vec_wordvector(w2v)
CPU times: user 178 ms, sys: 118 ms, total: 296 ms
Wall time: 301 ms

predict batch of strings with probability

def predict_proba(
    self,
    left_strings: List[str],
    right_strings: List[str],
    aggregation: Callable = np.mean,
    similarity: str = 'cosine',
    soft: bool = False,
):
    """
    calculate similarity for two different batch of texts.

    Parameters
    ----------
    left_strings : list of str
    right_strings : list of str
    aggregation : Callable, optional (default=numpy.mean)
    similarity : str, optional (default='mean')
        similarity supported. Allowed values:

        * ``'cosine'`` - cosine similarity.
        * ``'euclidean'`` - euclidean similarity.
        * ``'manhattan'`` - manhattan similarity.
    soft: bool, optional (default=False)
        word not inside word vector will replace with nearest word if True, else, will skip.

    Returns
    -------
    result: List[float]
    """
[5]:
%%time

doc2vec.predict_proba([string1], [string2])
CPU times: user 1.53 ms, sys: 786 µs, total: 2.31 ms
Wall time: 1.68 ms
[5]:
array([0.89971105])
[6]:
%%time

doc2vec.predict_proba([string1, string2], [string3, string4])
CPU times: user 2.55 ms, sys: 1.44 ms, total: 3.99 ms
Wall time: 2.73 ms
[6]:
array([0.91679387, 0.82348571])
[7]:
%%time

doc2vec.predict_proba([string1, string2], [string3, tweet1])
CPU times: user 1.68 ms, sys: 381 µs, total: 2.06 ms
Wall time: 1.75 ms
[7]:
array([0.91679387, 0.78542261])

visualize heatmap

def heatmap(
    self,
    strings: List[str],
    aggregation: Callable = np.mean,
    similarity: str = 'cosine',
    soft: bool = False,
    visualize: bool = True,
    annotate: bool = True,
    figsize: Tuple[int, int] = (7, 7),
):
    """
    plot a heatmap based on output from bert similarity.

    Parameters
    ----------
    strings : list of str
        list of strings
    aggregation : Callable, optional (default=numpy.mean)
    similarity : str, optional (default='mean')
        similarity supported. Allowed values:

        * ``'cosine'`` - cosine similarity.
        * ``'euclidean'`` - euclidean similarity.
        * ``'manhattan'`` - manhattan similarity.
    soft: bool, optional (default=True)
        word not inside word vector will replace with nearest word if True, else, will skip.
    visualize : bool
        if True, it will render plt.show, else return data.
    figsize : tuple, (default=(7, 7))
        figure size for plot.

    Returns
    -------
    result: list
        list of results.
    """
[8]:
doc2vec.heatmap([string1, string2, string3, string4])
_images/load-doc2vec-similarity_14_0.png

Different similarity function will return different percentage.

Doc2Vec using Vectorizer Model

We can use any Vectorizer models provided by Malaya to use encoder similarity interface, example, BERT, XLNET. Again, these encoder models not trained to do similarity classification, it just encode the strings into vector representation.

def doc2vec_vectorizer(vectorizer):
    """
    Doc2vec interface for text similarity using Vectorizer model.

    Parameters
    ----------
    vectorizer : object
        vectorizer interface object, BERT, XLNET.
        should have `vectorize` method.

    Returns
    -------
    result: malaya.similarity.VectorizerSimilarity
    """

using ALXLNET

[9]:
alxlnet = malaya.transformer.load(model = 'alxlnet')
doc2vec_vectorizer = malaya.similarity.doc2vec_vectorizer(alxlnet)
INFO:tensorflow:memory input None
INFO:tensorflow:Use float type <dtype: 'float32'>
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/alxlnet/modeling.py:810: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/layers/core.py:271: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/alxlnet/modeling.py:110: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/alxlnet-model/base/alxlnet-base/model.ckpt

predict for 2 strings with probability

def predict_proba(
    self,
    left_strings: List[str],
    right_strings: List[str],
    similarity: str = 'cosine',
):
    """
    calculate similarity for two different batch of texts.

    Parameters
    ----------
    left_strings : list of str
    right_strings : list of str
    similarity : str, optional (default='mean')
        similarity supported. Allowed values:

        * ``'cosine'`` - cosine similarity.
        * ``'euclidean'`` - euclidean similarity.
        * ``'manhattan'`` - manhattan similarity.

    Returns
    -------
    result: List[float]
    """
[11]:
%%time

doc2vec_vectorizer.predict_proba([string1], [string2])
CPU times: user 1.49 s, sys: 103 ms, total: 1.59 s
Wall time: 1.34 s
[11]:
array([0.89992255], dtype=float32)
[12]:
%%time

doc2vec_vectorizer.predict_proba([string1, string2], [string3, string4])
CPU times: user 504 ms, sys: 118 ms, total: 621 ms
Wall time: 139 ms
[12]:
array([0.64460504, 0.63204634], dtype=float32)

visualize heatmap

def heatmap(
    self,
    strings: List[str],
    similarity: str = 'cosine',
    visualize: bool = True,
    annotate: bool = True,
    figsize: Tuple[int, int] = (7, 7),
):
    """
    plot a heatmap based on output from bert similarity.

    Parameters
    ----------
    strings : list of str
        list of strings.
    similarity : str, optional (default='mean')
        similarity supported. Allowed values:

        * ``'cosine'`` - cosine similarity.
        * ``'euclidean'`` - euclidean similarity.
        * ``'manhattan'`` - manhattan similarity.
    visualize : bool
        if True, it will render plt.show, else return data.
    figsize : tuple, (default=(7, 7))
        figure size for plot.

    Returns
    -------
    result: list
        list of results
    """
[13]:
doc2vec_vectorizer.heatmap([string1, string2, string3, string4])
_images/load-doc2vec-similarity_23_0.png