Contents

Semantic Similarity

Contents

Semantic Similarity#

This tutorial is available as an IPython notebook at Malaya/example/similarity-semantic.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]:

import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

[2]:

import logging

logging.basicConfig(level=logging.INFO)

[3]:

%%time
import malaya

/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "

/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmp_ntnvpdo
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmp_ntnvpdo/_remote_module_non_scriptable.py

CPU times: user 2.81 s, sys: 3.89 s, total: 6.7 s
Wall time: 1.94 s

/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

[4]:

string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'
string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'
string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'
string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'

[5]:

news1 = 'Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19'
tweet1 = 'DrM sembang pilihan raya tak boleh buat sebab COVID 19'

List available HuggingFace models#

[6]:

malaya.similarity.semantic.available_huggingface

[6]:

{'mesolitica/finetune-mnli-t5-super-tiny-standard-bahasa-cased': {'Size (MB)': 50.7,
  'macro precision': 0.74562,
  'macro recall': 0.74574,
  'macro f1-score': 0.74501},
 'mesolitica/finetune-mnli-t5-tiny-standard-bahasa-cased': {'Size (MB)': 139,
  'macro precision': 0.76584,
  'macro recall': 0.76565,
  'macro f1-score': 0.76542},
 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased': {'Size (MB)': 242,
  'macro precision': 0.78067,
  'macro recall': 0.78063,
  'macro f1-score': 0.7801},
 'mesolitica/finetune-mnli-t5-base-standard-bahasa-cased': {'Size (MB)': 892,
  'macro precision': 0.78903,
  'macro recall': 0.79064,
  'macro f1-score': 0.78918}}

[7]:

print(malaya.similarity.semantic.info)

tested on matched dev set translated MNLI, https://huggingface.co/datasets/mesolitica/translated-MNLI

Load HuggingFace model#

def huggingface(
    model: str = 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to calculate semantic similarity between 2 sentences.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')
        Check available models at `malaya.similarity.semantic.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.Similarity
    """

[ ]:

model = malaya.similarity.semantic.huggingface()

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.

predict batch of strings with probability#

def predict_proba(self, strings_left: List[str], strings_right: List[str]):
        """
        calculate similarity for two different batch of texts.

        Parameters
        ----------
        strings_left : List[str]
        strings_right : List[str]

        Returns
        -------
        list: List[float]
        """

you need to give list of left strings, and list of right strings.

first left string will compare will first right string and so on.

similarity model only supported predict_proba.

[ ]:

model.predict_proba([string1, string2, news1, news1], [string3, string4, tweet1, string1])

able to infer for mixed MS and EN#

[ ]:

en = 'Youth on hunger strike urge the government to be concerned about the climate issue'
en2 = 'the end of the wrld, global warming!'

[ ]:

model.predict_proba([string1, string1], [en2, en])