import malaya
string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'
string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'
string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'
string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'

Calculate similarity using doc2vec

We can use any word vector interface provided by Malaya to use doc2vec similarity interface.

Important parameters, 1. aggregation, aggregation function to accumulate word vectors. Default is mean.

* ``'mean'`` - mean.
* ``'min'`` - min.
* ``'max'`` - max.
* ``'sum'`` - sum.
* ``'sqrt'`` - square root.
  1. similarity distance function to calculate similarity. Default is cosine.
    • 'cosine' - cosine similarity.
    • 'euclidean' - euclidean similarity.
    • 'manhattan' - manhattan similarity.

Using word2vec

I will use load_news, word2vec from wikipedia took a very long time. wikipedia much more accurate.

embedded_news = malaya.wordvector.load_news(256)
w2v_wiki = malaya.wordvector.load(embedded_news['nce_weights'],
doc2vec = malaya.similarity.doc2vec(w2v_wiki)
predict for 2 strings

doc2vec.predict(string1, string2, aggregation = 'mean', soft = False)

predict batch of strings

doc2vec.predict_batch([string1, string2], [string3, string4])
array([0.9507282 , 0.88227606], dtype=float32)

visualize tree plot

doc2vec.tree_plot([string1, string2, string3, string4])
Different similarity function different percentage.

Calculate similarity using deep encoder

We can use any encoder models provided by Malaya to use encoder similarity interface, example, BERT, XLNET, and skip-thought. Again, these encoder models not trained to do similarity classification, it just encode the strings into vector representation.

Important parameters,

  1. similarity distance function to calculate similarity. Default is cosine.
    • 'cosine' - cosine similarity.
    • 'euclidean' - euclidean similarity.
    • 'manhattan' - manhattan similarity.

using xlnet

xlnet = malaya.xlnet.xlnet(model = 'small')
encoder = malaya.similarity.encoder(xlnet)
predict for 2 strings

encoder.predict(string1, string2)

predict batch of strings

encoder.predict_batch([string1, string2], [string3, string4])
array([0.97005975, 0.9447437 ], dtype=float32)

visualize tree plot

encoder.tree_plot([string1, string2, string3, string4])
BERT model

BERT is the best similarity model in term of accuracy, you can check similarity accuracy here, https://malaya.readthedocs.io/en/latest/Accuracy.html#similarity. Question is, why BERT?

  1. Transformer model learn the context of a word based on all of its surroundings (live string), bidirectionally. So it much better understand left and right hand side relationships.
  2. Because of transformer able to leverage to context during live string, we dont need to capture available words in this world, instead capture substrings and build the attention after that. BERT will never have Out-Of-Vocab problem.

List available BERT models

['multilanguage', 'base', 'small']
model = malaya.similarity.bert(model = 'base')
model.predict(string1, string3)
model.predict_batch([string1, string2], [string3, string4])
array([0.03622618, 0.03146545], dtype=float32)

