Sentiment Analysis

This tutorial is available as an IPython notebook at Malaya/example/sentiment.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]:
%%time
import malaya
CPU times: user 4.37 s, sys: 709 ms, total: 5.08 s
Wall time: 5.45 s
[2]:
string1 = 'Sis, students from overseas were brought back because they are not in their countries which is if something happens to them, its not the other countries’ responsibility. Student dalam malaysia ni dah dlm tggjawab kerajaan. Mana part yg tak faham?'
string2 = 'Harap kerajaan tak bukak serentak. Slowly release week by week. Focus on economy related industries dulu'
string3 = 'Idk if aku salah baca ke apa. Bayaran rm350 utk golongan umur 21 ke bawah shj ? Anyone? If 21 ke atas ok lah. If umur 21 ke bawah?  Are you serious? Siapa yg lebih byk komitmen? Aku hrp aku salah baca. Aku tk jumpa artikel tu'
string4 = 'Jabatan Penjara Malaysia diperuntukkan RM20 juta laksana program pembangunan Insan kepada banduan. Majikan yang menggaji bekas banduan, bekas penagih dadah diberi potongan cukai tambahan sehingga 2025.'

Load multinomial model

def multinomial(**kwargs):
    """
    Load multinomial emotion model.

    Returns
    -------
    result : malaya.model.ml.Bayes class
    """
[3]:
model = malaya.sentiment.multinomial()

Predict batch of strings

def predict(self, strings: List[str], add_neutral: bool = True):
    """
    classify list of strings.

    Parameters
    ----------
    strings: List[str]
    add_neutral: bool, optional (default=True)
        if True, it will add neutral probability.

    Returns
    -------
    result: List[str]
    """
[4]:
model.predict([string1, string2])
[4]:
['neutral', 'neutral']

Disable neutral probability,

[ ]:
model.predict([string1, string2], add_neutral = False)

Predict batch of strings with probability

def predict_proba(self, strings: List[str], add_neutral: bool = True):
    """
    classify list of strings and return probability.

    Parameters
    ----------
    strings: List[str]
    add_neutral: bool, optional (default=True)
        if True, it will add neutral probability.

    Returns
    -------
    result: List[dict[str, float]]
    """
[5]:
model.predict_proba([string1, string2])
[5]:
[{'negative': 0.008213267932937583,
  'positive': 0.17867320670623799,
  'neutral': 0.8131135253608244},
 {'negative': 0.010098264096992408,
  'positive': 0.009901735903007554,
  'neutral': 0.98}]

Disable neutral probability,

[6]:
model.predict_proba([string1, string2], add_neutral = False)
[6]:
[{'negative': 0.4106633966468791, 'positive': 0.589336603353119},
 {'negative': 0.5049132048496204, 'positive': 0.49508679515037773}]

List available Transformer models

[8]:
malaya.sentiment.available_transformer()
INFO:root:tested on 20% test set.
[8]:
Size (MB) Quantized Size (MB) macro precision macro recall macro f1-score
bert 425.6 111.00 0.99330 0.99330 0.99329
tiny-bert 57.4 15.40 0.98774 0.98774 0.98774
albert 48.6 12.80 0.99227 0.99226 0.99226
tiny-albert 22.4 5.98 0.98554 0.98550 0.98551
xlnet 446.6 118.00 0.99353 0.99353 0.99353
alxlnet 46.8 13.30 0.99188 0.99188 0.99188

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/Accuracy.html#sentiment-analysis

You might want to use Tiny-Albert, a very small size, 22.4MB, but the accuracy is still on the top notch.

Load Transformer model

def transformer(model: str = 'bert', quantized: bool = False, **kwargs):
    """
    Load Transformer sentiment model.

    Parameters
    ----------
    model : str, optional (default='bert')
        Model architecture supported. Allowed values:

        * ``'bert'`` - Google BERT BASE parameters.
        * ``'tiny-bert'`` - Google BERT TINY parameters.
        * ``'albert'`` - Google ALBERT BASE parameters.
        * ``'tiny-albert'`` - Google ALBERT TINY parameters.
        * ``'xlnet'`` - Google XLNET BASE parameters.
        * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya.supervised.softmax.transformer function
    """
[14]:
model = malaya.sentiment.transformer(model = 'xlnet')

Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[10]:
quantized_model = malaya.sentiment.transformer(model = 'xlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

Predict batch of strings

def predict(self, strings: List[str], add_neutral: bool = True):
    """
    classify list of strings.

    Parameters
    ----------
    strings: List[str]
    add_neutral: bool, optional (default=True)
        if True, it will add neutral probability.

    Returns
    -------
    result: List[str]
    """
[12]:
%%time

model.predict([string1, string2])
CPU times: user 4.08 s, sys: 1.44 s, total: 5.51 s
Wall time: 4.67 s
[12]:
['positive', 'negative']
[13]:
%%time

quantized_model.predict([string1, string2])
CPU times: user 3.51 s, sys: 1.33 s, total: 4.84 s
Wall time: 3.8 s
[13]:
['positive', 'positive']

Predict batch of strings with probability

def predict_proba(self, strings: List[str], add_neutral: bool = True):
    """
    classify list of strings and return probability.

    Parameters
    ----------
    strings: List[str]
    add_neutral: bool, optional (default=True)
        if True, it will add neutral probability.

    Returns
    -------
    result: List[dict[str, float]]
    """
[7]:
%%time

model.predict_proba([string1, string2])
CPU times: user 5.05 s, sys: 2.6 s, total: 7.65 s
Wall time: 10.1 s
[7]:
[{'negative': 0.00032528088, 'positive': 0.96747196, 'neutral': 0.03220278},
 {'negative': 0.98301303, 'positive': 0.0001698712, 'neutral': 0.016817093}]
[5]:
%%time

quantized_model.predict_proba([string1, string2])
CPU times: user 1.64 s, sys: 387 ms, total: 2.03 s
Wall time: 1.43 s
[5]:
[{'negative': 0.0007685767, 'positive': 0.9231422, 'neutral': 0.0760892},
 {'negative': 8.198959e-06, 'positive': 0.9991802, 'neutral': 0.00081157684}]
[13]:
model.predict_proba([string1, string2], add_neutral = False)
[13]:
[{'negative': 0.029847767, 'positive': 0.97015226},
 {'negative': 0.1034979, 'positive': 0.89650214}]
[5]:
quantized_model.predict_proba([string1, string2], add_neutral = False)
[5]:
[{'negative': 0.004556194, 'positive': 0.9954438},
 {'negative': 0.07760632, 'positive': 0.9223937}]

Open emotion visualization dashboard

Default when you call predict_words it will open a browser with visualization dashboard, you can disable by visualization=False.

[15]:
model.predict_words(string1)
[16]:
from IPython.core.display import Image, display

display(Image('sentiment-dashboard.png', width=800))
_images/load-sentiment_32_0.png

Vectorize

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str], method: str = 'first'):
    """
    vectorize list of strings.

    Parameters
    ----------
    strings: List[str]
    method : str, optional (default='first')
        Vectorization layer supported. Allowed values:

        * ``'last'`` - vector from last sequence.
        * ``'first'`` - vector from first sequence.
        * ``'mean'`` - average vectors from all sequences.
        * ``'word'`` - average vectors based on tokens.

    Returns
    -------
    result: np.array
    """

Sentence level

[5]:
r = quantized_model.vectorize([string1, string2, string3, string4], method = 'first')
[6]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(r)
tsne.shape
[6]:
(4, 2)
[7]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = [string1, string2, string3, string4]
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-sentiment_37_0.png

Word level

[8]:
r = quantized_model.vectorize([string1, string2, string3, string4], method = 'word')
[9]:
x, y = [], []
for row in r:
    x.extend([i[0] for i in row])
    y.extend([i[1] for i in row])
[10]:
tsne = TSNE().fit_transform(y)
tsne.shape
[10]:
(129, 2)
[11]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-sentiment_42_0.png

Pretty good, the model able to know cluster top left as positive sentiment, bottom right as negative sentiment.

Stacking models

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

[5]:
multinomial = malaya.sentiment.multinomial()
alxlnet = malaya.sentiment.transformer(model = 'alxlnet')
[8]:
malaya.stack.predict_stack([multinomial, alxlnet, model], [string1, string2])
[8]:
[{'negative': 0.0005453552136673502,
  'positive': 0.5603020846001405,
  'neutral': 0.05399025419995675},
 {'negative': 0.0002248290781177622,
  'positive': 0.21361579430243546,
  'neutral': 0.022142383292097452}]

If you do not want neutral in predict_stack, simply override the parameter,

[9]:
malaya.stack.predict_stack([multinomial, alxlnet, model], [string1, string2], add_neutral = False)
[9]:
[{'negative': 0.05828375571937787, 'positive': 0.8221586003437801},
 {'negative': 0.014352668987571138, 'positive': 0.7835866999009022}]