Relevancy Analysis

This tutorial is available as an IPython notebook at Malaya/example/relevancy.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:
%%time
import malaya
CPU times: user 5.51 s, sys: 1.08 s, total: 6.59 s
Wall time: 7.61 s

Explanation

Positive relevancy: The article or piece of text is relevant, tendency is high to become not a fake news. Can be a positive or negative sentiment.

Negative relevancy: The article or piece of text is not relevant, tendency is high to become a fake news. Can be a positive or negative sentiment.

Right now relevancy module only support deep learning model.

[2]:
negative_text = 'Roti Massimo Mengandungi DNA Babi. Roti produk Massimo keluaran Syarikat The Italian Baker mengandungi DNA babi. Para pengguna dinasihatkan supaya tidak memakan produk massimo. Terdapat pelbagai produk roti keluaran syarikat lain yang boleh dimakan dan halal. Mari kita sebarkan berita ini supaya semua rakyat Malaysia sedar dengan apa yang mereka makna setiap hari. Roti tidak halal ada DNA babi jangan makan ok.'
positive_text = 'Jabatan Kemajuan Islam Malaysia memperjelaskan dakwaan sebuah mesej yang dikitar semula, yang mendakwa kononnya kod E dikaitkan dengan kandungan lemak babi sepertimana yang tular di media sosial. . Tular: November 2017 . Tular: Mei 2014 JAKIM ingin memaklumkan kepada masyarakat berhubung maklumat yang telah disebarkan secara meluas khasnya melalui media sosial berhubung kod E yang dikaitkan mempunyai lemak babi. Untuk makluman, KOD E ialah kod untuk bahan tambah (aditif) dan ianya selalu digunakan pada label makanan di negara Kesatuan Eropah. Menurut JAKIM, tidak semua nombor E yang digunakan untuk membuat sesuatu produk makanan berasaskan dari sumber yang haram. Sehubungan itu, sekiranya sesuatu produk merupakan produk tempatan dan mendapat sijil Pengesahan Halal Malaysia, maka ia boleh digunakan tanpa was-was sekalipun mempunyai kod E-kod. Tetapi sekiranya produk tersebut bukan produk tempatan serta tidak mendapat sijil pengesahan halal Malaysia walaupun menggunakan e-kod yang sama, pengguna dinasihatkan agar berhati-hati dalam memilih produk tersebut.'

List available Transformer models

[3]:
malaya.relevancy.available_transformer()
INFO:root:tested on 20% test set.
[3]:
Size (MB) Quantized Size (MB) Accuracy
bert 425.6 111.00 0.89740
tiny-bert 57.4 15.40 0.87413
albert 48.6 12.80 0.88268
tiny-albert 22.4 5.98 0.82780
xlnet 446.6 118.00 0.92762
alxlnet 46.8 13.30 0.91231

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/Accuracy.html#relevancy

You might want to use Alxlnet, a very small size, 46.8MB, but the accuracy is still on the top notch.

Load ALXLNET model

All model interface will follow sklearn interface started v3.4,

model.predict(List[str])

model.predict_proba(List[str])
[4]:
model = malaya.relevancy.transformer(model = 'alxlnet')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[15]:
quantized_model = malaya.relevancy.transformer(model = 'alxlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

Predict batch of strings

[6]:
%%time

model.predict_proba([negative_text, positive_text])
CPU times: user 5.26 s, sys: 931 ms, total: 6.19 s
Wall time: 3.41 s
[6]:
[{'not relevant': 0.9999548, 'relevant': 4.5165893e-05},
 {'not relevant': 1.2674857e-06, 'relevant': 0.9999987}]
[7]:
%%time

quantized_model.predict_proba([negative_text, positive_text])
CPU times: user 5.06 s, sys: 796 ms, total: 5.85 s
Wall time: 3.07 s
[7]:
[{'not relevant': 0.9999988, 'relevant': 1.1694748e-06},
 {'not relevant': 3.5770352e-06, 'relevant': 0.9999964}]

Open relevancy visualization dashboard

Default when you call predict_words it will open a browser with visualization dashboard, you can disable by visualization=False.

[9]:
model.predict_words(negative_text)
[10]:
from IPython.core.display import Image, display

display(Image('relevancy-dashboard.png', width=800))
_images/load-relevancy_18_0.png

Vectorize

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str], method: str = 'first'):
    """
    vectorize list of strings.

    Parameters
    ----------
    strings: List[str]
    method : str, optional (default='first')
        Vectorization layer supported. Allowed values:

        * ``'last'`` - vector from last sequence.
        * ``'first'`` - vector from first sequence.
        * ``'mean'`` - average vectors from all sequences.
        * ``'word'`` - average vectors based on tokens.

    Returns
    -------
    result: np.array
    """

Sentence level

[8]:
texts = [negative_text, positive_text]
r = quantized_model.vectorize(texts, method = 'first')
[9]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(r)
tsne.shape
[9]:
(2, 2)
[10]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = texts
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-relevancy_23_0.png

Word level

[11]:
r = quantized_model.vectorize(texts, method = 'word')
[12]:
x, y = [], []
for row in r:
    x.extend([i[0] for i in row])
    y.extend([i[1] for i in row])
[13]:
tsne = TSNE().fit_transform(y)
tsne.shape
[13]:
(211, 2)
[14]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-relevancy_28_0.png

Pretty good, the model able to know cluster bottom left as positive relevancy.

Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

[ ]:
albert = malaya.relevancy.transformer(model = 'albert')
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:loading sentence piece model
[14]:
malaya.stack.predict_stack([albert, model], [positive_text, negative_text])
[14]:
[{'not relevant': 3.1056952e-06, 'relevant': 0.9999934},
 {'not relevant': 0.99982065, 'relevant': 3.868528e-05}]