Semantic Similarity
Contents
Semantic Similarity#
This tutorial is available as an IPython notebook at Malaya/example/similarity-semantic.
This module trained on both standard and local (included social media) language structures, so it is save to use for both.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
[2]:
import logging
logging.basicConfig(level=logging.INFO)
[3]:
%%time
import malaya
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CPU times: user 3.19 s, sys: 2.59 s, total: 5.79 s
Wall time: 2.76 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[4]:
string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'
string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'
string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'
string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'
[5]:
news1 = 'Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19'
tweet1 = 'DrM sembang pilihan raya tak boleh buat sebab COVID 19'
List available HuggingFace models#
[6]:
malaya.similarity.semantic.available_huggingface
[6]:
{'mesolitica/finetune-mnli-nanot5-small': {'Size (MB)': 148,
'macro precision': 0.87125,
'macro recall': 0.87131,
'macro f1-score': 0.87127},
'mesolitica/finetune-mnli-nanot5-base': {'Size (MB)': 892,
'macro precision': 0.78903,
'macro recall': 0.79064,
'macro f1-score': 0.78918}}
[7]:
print(malaya.similarity.semantic.info)
tested on matched dev set translated MNLI, https://huggingface.co/datasets/mesolitica/translated-MNLI
Load HuggingFace model#
def huggingface(
model: str = 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased',
force_check: bool = True,
**kwargs,
):
"""
Load HuggingFace model to calculate semantic similarity between 2 sentences.
Parameters
----------
model: str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')
Check available models at `malaya.similarity.semantic.available_huggingface()`.
force_check: bool, optional (default=True)
Force check model one of malaya model.
Set to False if you have your own huggingface model.
Returns
-------
result: malaya.torch_model.huggingface.Similarity
"""
[8]:
model = malaya.similarity.semantic.huggingface()
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
predict batch of strings with probability#
def predict_proba(self, strings_left: List[str], strings_right: List[str]):
"""
calculate similarity for two different batch of texts.
Parameters
----------
strings_left : List[str]
strings_right : List[str]
Returns
-------
list: List[float]
"""
you need to give list of left strings, and list of right strings.
first left string will compare will first right string and so on.
similarity model only supported predict_proba
.
[9]:
model.predict_proba([string1, string2, news1, news1], [string3, string4, tweet1, string1])
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[9]:
array([0.9980286 , 0.02898298, 0.79173875, 0.9694676 ], dtype=float32)
able to infer for mixed MS and EN#
[10]:
en = 'Youth on hunger strike urge the government to be concerned about the climate issue'
en2 = 'the end of the wrld, global warming!'
[11]:
model.predict_proba([string1, string1], [en2, en])
[11]:
array([0.00405155, 0.7731996 ], dtype=float32)