Semantic Similarity
Contents
Semantic Similarity#
This tutorial is available as an IPython notebook at Malaya/example/semantic-similarity.
This module trained on both standard and local (included social media) language structures, so it is save to use for both.
[1]:
import logging
logging.basicConfig(level=logging.INFO)
[2]:
%%time
import malaya
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
CPU times: user 5.71 s, sys: 1.1 s, total: 6.81 s
Wall time: 7.79 s
[2]:
string1 = 'Pemuda mogok lapar desak kerajaan prihatin isu iklim'
string2 = 'Perbincangan isu pembalakan perlu babit kerajaan negeri'
string3 = 'kerajaan perlu kisah isu iklim, pemuda mogok lapar'
string4 = 'Kerajaan dicadang tubuh jawatankuasa khas tangani isu alam sekitar'
[3]:
news1 = 'Tun Dr Mahathir Mohamad mengakui pembubaran Parlimen bagi membolehkan pilihan raya diadakan tidak sesuai dilaksanakan pada masa ini berikutan isu COVID-19'
tweet1 = 'DrM sembang pilihan raya tak boleh buat sebab COVID 19'
List available Transformer models#
[3]:
malaya.similarity.available_transformer()
INFO:malaya.similarity:trained on 80% dataset, tested on another 20% test set, dataset at https://github.com/huseinzol05/Malay-Dataset/tree/master/text-similarity
[3]:
Size (MB) | Quantized Size (MB) | macro precision | macro recall | macro f1-score | |
---|---|---|---|---|---|
bert | 423.4 | 111.0 | 0.88315 | 0.88656 | 0.88405 |
tiny-bert | 56.6 | 15.0 | 0.87210 | 0.87546 | 0.87292 |
albert | 48.3 | 12.8 | 0.87164 | 0.87146 | 0.87155 |
tiny-albert | 21.9 | 6.0 | 0.82234 | 0.82383 | 0.82295 |
xlnet | 448.7 | 119.0 | 0.80866 | 0.76775 | 0.77112 |
alxlnet | 49.0 | 13.9 | 0.88756 | 0.88700 | 0.88727 |
Load transformer model#
def transformer(model: str = 'bert', quantized: bool = False, **kwargs):
"""
Load Transformer similarity model.
Parameters
----------
model : str, optional (default='bert')
Model architecture supported. Allowed values:
* ``'bert'`` - Google BERT BASE parameters.
* ``'tiny-bert'`` - Google BERT TINY parameters.
* ``'albert'`` - Google ALBERT BASE parameters.
* ``'tiny-albert'`` - Google ALBERT TINY parameters.
* ``'xlnet'`` - Google XLNET BASE parameters.
* ``'alxlnet'`` - Malaya ALXLNET BASE parameters.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result: model
List of model classes:
* if `bert` in model, will return `malaya.model.bert.SiameseBERT`.
* if `xlnet` in model, will return `malaya.model.xlnet.SiameseXLNET`.
"""
[5]:
model = malaya.similarity.transformer(model = 'alxlnet')
Load Quantized model#
To load 8-bit quantized model, simply pass quantized = True
, default is False
.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[6]:
quantized_model = malaya.similarity.transformer(model = 'alxlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
predict batch of strings with probability#
def predict_proba(self, strings_left: List[str], strings_right: List[str]):
"""
calculate similarity for two different batch of texts.
Parameters
----------
string_left : List[str]
string_right : List[str]
Returns
-------
result : List[float]
"""
you need to give list of left strings, and list of right strings.
first left string will compare will first right string and so on.
similarity model only supported predict_proba
.
[7]:
model.predict_proba([string1, string2, news1, news1], [string3, string4, tweet1, string1])
[7]:
array([0.99828064, 0.01076903, 0.9603669 , 0.9075881 ], dtype=float32)
[8]:
quantized_model.predict_proba([string1, string2, news1, news1], [string3, string4, tweet1, string1])
[8]:
array([0.9987801 , 0.00554545, 0.8729592 , 0.49839294], dtype=float32)
visualize heatmap#
def heatmap(
self,
strings: List[str],
visualize: bool = True,
annotate: bool = True,
figsize: Tuple[int, int] = (7, 7),
):
"""
plot a heatmap based on output from similarity
Parameters
----------
strings : list of str
list of strings.
visualize : bool
if True, it will render plt.show, else return data.
figsize : tuple, (default=(7, 7))
figure size for plot.
Returns
-------
result: list
list of results
"""
[9]:
model.heatmap([string1, string2, string3, string4])
Vectorize#
Let say you want to visualize sentences in lower dimension, you can use model.vectorize
,
def vectorize(self, strings: List[str]):
"""
Vectorize list of strings.
Parameters
----------
strings : List[str]
Returns
-------
result: np.array
"""
[10]:
texts = [string1, string2, string3, string4, news1, tweet1]
r = quantized_model.vectorize(texts)
[11]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE().fit_transform(r)
tsne.shape
[11]:
(6, 2)
[12]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = texts
for label, x, y in zip(
labels, tsne[:, 0], tsne[:, 1]
):
label = (
'%s, %.3f' % (label[0], label[1])
if isinstance(label, list)
else label
)
plt.annotate(
label,
xy = (x, y),
xytext = (0, 0),
textcoords = 'offset points',
)
Stacking models#
More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html
If you want to stack semantic similarity models, you need to pass labels using strings_right
parameter,
malaya.stack.predict_stack([model1, model2], List[str], strings_right = List[str])
We will passed strings_right
as **kwargs
.
[13]:
alxlnet = malaya.similarity.transformer(model = 'alxlnet')
albert = malaya.similarity.transformer(model = 'albert')
tiny_bert = malaya.similarity.transformer(model = 'tiny-bert')
[14]:
malaya.stack.predict_stack([alxlnet, albert, tiny_bert], [string1, string2, news1, news1],
strings_right = [string3, string4, tweet1, string1])
[14]:
array([0.9968965 , 0.17514098, 0.11507297, 0.01998391], dtype=float32)