Semantic Similarity HuggingFace#

This tutorial is available as an IPython notebook at Malaya/example/semantic-similarity-huggingface.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
[2]:
import logging

logging.basicConfig(level=logging.INFO)
[3]:
%%time
import malaya
CPU times: user 3.11 s, sys: 3.56 s, total: 6.67 s
Wall time: 2.22 s
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3372
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3890
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[4]:
string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'
string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'
string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'
string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'
[5]:
news1 = 'Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19'
tweet1 = 'DrM sembang pilihan raya tak boleh buat sebab COVID 19'

List available HuggingFace models#

[6]:
malaya.similarity.semantic.available_huggingface()
INFO:malaya.similarity.semantic:tested on matched dev set translated MNLI, https://huggingface.co/datasets/mesolitica/translated-MNLI
[6]:
Size (MB) macro precision macro recall macro f1-score
mesolitica/finetune-mnli-t5-super-tiny-standard-bahasa-cased 50.7 0.88756 0.887 0.88727
mesolitica/finetune-mnli-t5-tiny-standard-bahasa-cased 139.0 0.88756 0.887 0.88727
mesolitica/finetune-mnli-t5-small-standard-bahasa-cased 242.0 0.88756 0.887 0.88727
mesolitica/finetune-mnli-t5-base-standard-bahasa-cased 892.0 0.88756 0.887 0.88727

Load HuggingFace model#

def huggingface(model: str = 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased', **kwargs):
    """
    Load HuggingFace model to calculate semantic similarity between 2 sentences.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')
        Check available models at `malaya.similarity.semantic.available_huggingface()`.

    Returns
    -------
    result: malaya.torch_model.huggingface.Similarity
    """
[7]:
model = malaya.similarity.semantic.huggingface()

predict batch of strings with probability#

def predict_proba(self, strings_left: List[str], strings_right: List[str]):
        """
        calculate similarity for two different batch of texts.

        Parameters
        ----------
        strings_left : List[str]
        strings_right : List[str]

        Returns
        -------
        list: List[float]
        """

you need to give list of left strings, and list of right strings.

first left string will compare will first right string and so on.

similarity model only supported predict_proba.

[8]:
model.predict_proba([string1, string2, news1, news1], [string3, string4, tweet1, string1])
[8]:
array([0.9973929 , 0.00111997, 0.5448353 , 0.0183536 ], dtype=float32)

able to infer for mixed MS and EN#

[18]:
en = 'Youth on hunger strike urge the government to be concerned about the climate issue'
en2 = 'the end of the wrld, global warming!'
[20]:
model.predict_proba([string1, string1], [en2, en])
[20]:
array([0.01690625, 0.9966125 ], dtype=float32)