Semantic Similarity#

This tutorial is available as an IPython notebook at Malaya/example/semantic-similarity.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

This interface deprecated, use HuggingFace interface instead.

import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
import logging

import malaya
CPU times: user 3.07 s, sys: 3.7 s, total: 6.77 s
Wall time: 2.18 s
/home/husein/dev/malaya/malaya/ FutureWarning: Possible nested set at position 3372
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/ FutureWarning: Possible nested set at position 3890
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
import warnings
string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'
string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'
string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'
string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'
news1 = 'Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19'
tweet1 = 'DrM sembang pilihan raya tak boleh buat sebab COVID 19'

List available Transformer models#

/home/husein/dev/malaya/malaya/similarity/ DeprecationWarning: `malaya.similarity.semantic.available_transformer` is deprecated, use `malaya.similarity.semantic.available_huggingface` instead
INFO:malaya.similarity.semantic:tested on matched dev set translated MNLI,
Size (MB) Quantized Size (MB) macro precision macro recall macro f1-score
bert 423.4 111.0 0.88315 0.88656 0.88405
tiny-bert 56.6 15.0 0.87210 0.87546 0.87292
albert 48.3 12.8 0.87164 0.87146 0.87155
tiny-albert 21.9 6.0 0.82234 0.82383 0.82295
xlnet 448.7 119.0 0.80866 0.76775 0.77112
alxlnet 49.0 13.9 0.88756 0.88700 0.88727

Load transformer model#

def transformer(model: str = 'bert', quantized: bool = False, **kwargs):
    Load Transformer similarity model.

    model: str, optional (default='bert')
        Check available models at `malaya.similarity.semantic.available_transformer()`.
    quantized: bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    result: model
        List of model classes:

        * if `bert` in model, will return `malaya.model.bert.SiameseBERT`.
        * if `xlnet` in model, will return `malaya.model.xlnet.SiameseXLNET`.
model = malaya.similarity.semantic.transformer(model = 'alxlnet')
/home/husein/dev/malaya/malaya/similarity/ DeprecationWarning: `malaya.similarity.semantic.transformer` is deprecated, use `malaya.similarity.semantic.huggingface` instead
2022-11-02 21:38:42.447055: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-02 21:38:42.450637: E tensorflow/stream_executor/cuda/] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-02 21:38:42.450654: I tensorflow/stream_executor/cuda/] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-11-02 21:38:42.450657: I tensorflow/stream_executor/cuda/] hostname: husein-MS-7D31
2022-11-02 21:38:42.450726: I tensorflow/stream_executor/cuda/] libcuda reported version is: Not found: was unable to find DSO loaded into this program
2022-11-02 21:38:42.450744: I tensorflow/stream_executor/cuda/] kernel reported version is: 470.141.3

Load Quantized model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

quantized_model = malaya.similarity.semantic.transformer(model = 'alxlnet', quantized = True)

predict batch of strings with probability#

def predict_proba(self, strings_left: List[str], strings_right: List[str]):
    calculate similarity for two different batch of texts.

    string_left : List[str]
    string_right : List[str]

    result : List[float]

you need to give list of left strings, and list of right strings.

first left string will compare will first right string and so on.

similarity model only supported predict_proba.

model.predict_proba([string1, string2, news1, news1], [string3, string4, tweet1, string1])
array([0.99337685, 0.01469913, 0.5436511 , 0.44653463], dtype=float32)
quantized_model.predict_proba([string1, string2, news1, news1], [string3, string4, tweet1, string1])
array([0.99733776, 0.00935277, 0.97150946, 0.7315555 ], dtype=float32)


Let say you want to visualize sentences in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str]):
    Vectorize list of strings.

    strings : List[str]

    result: np.array
texts = [string1, string2, string3, string4, news1, tweet1]
r = quantized_model.vectorize(texts)
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(r)
(6, 2)
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = texts
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',

Stacking models#

More information, you can read at

If you want to stack semantic similarity models, you need to pass labels using strings_right parameter,

malaya.stack.predict_stack([model1, model2], List[str], strings_right = List[str])

We will passed strings_right as **kwargs.

alxlnet = malaya.similarity.transformer(model = 'alxlnet')
albert = malaya.similarity.transformer(model = 'albert')
tiny_bert = malaya.similarity.transformer(model = 'tiny-bert')
malaya.stack.predict_stack([alxlnet, albert, tiny_bert], [string1, string2, news1, news1],
                           strings_right = [string3, string4, tweet1, string1])
array([0.9968965 , 0.17514098, 0.11507297, 0.01998391], dtype=float32)