Sentiment Analysis#

This tutorial is available as an IPython notebook at Malaya/example/sentiment.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]:
import logging

logging.basicConfig(level=logging.INFO)
[2]:
%%time
import malaya
CPU times: user 3.21 s, sys: 3.58 s, total: 6.8 s
Wall time: 3.76 s
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3372
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3890
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

labels supported#

Default labels for sentiment module.

[3]:
malaya.sentiment.label
[3]:
['negative', 'neutral', 'positive']

Example texts#

Copy pasted from random tweets.

[6]:
string1 = 'Sis, students from overseas were brought back because they are not in their countries which is if something happens to them, its not the other countries’ responsibility. Student dalam malaysia ni dah dlm tggjawab kerajaan. Mana part yg tak faham?'
string2 = 'Harap kerajaan tak bukak serentak. Slowly release week by week. Focus on economy related industries dulu'
string3 = 'Idk if aku salah baca ke apa. Bayaran rm350 utk golongan umur 21 ke bawah shj ? Anyone? If 21 ke atas ok lah. If umur 21 ke bawah?  Are you serious? Siapa yg lebih byk komitmen? Aku hrp aku salah baca. Aku tk jumpa artikel tu'
string4 = 'Jabatan Penjara Malaysia diperuntukkan RM20 juta laksana program pembangunan Insan kepada banduan. Majikan yang menggaji bekas banduan, bekas penagih dadah diberi potongan cukai tambahan sehingga 2025.'
string5 = 'Dua Hari Nyaris Hatrick, Murai Batu Ceriwis Siap Meraikan Even Bekasi Bersatu!'
string6 = '@MasidiM Moga kerajaan sabah, tidak ikut pkp macam kerajaan pusat. Makin lama pkp, makin ramai hilang pekerjaan. Ti https://t.co/nSIABkkEDS'
string7 = 'Hopefully esok boleh ambil gambar dengan'

Load multinomial model#

def multinomial(**kwargs):
    """
    Load multinomial emotion model.

    Returns
    -------
    result : malaya.model.ml.Bayes class
    """
[7]:
model = malaya.sentiment.multinomial()

Predict batch of strings#

def predict(self, strings: List[str]):
    """
    classify list of strings.

    Parameters
    ----------
    strings: List[str]

    Returns
    -------
    result: List[str]
    """
[9]:
model.predict([string1, string2, string3, string4, string5, string6, string7])
[9]:
['negative',
 'negative',
 'negative',
 'negative',
 'neutral',
 'negative',
 'positive']

Predict batch of strings with probability#

def predict_proba(self, strings: List[str]):
    """
    classify list of strings and return probability.

    Parameters
    ----------
    strings: List[str]

    Returns
    -------
    result: List[dict[str, float]]
    """
[10]:
model.predict_proba([string1, string2, string3, string4, string5, string6, string7])
[10]:
[{'negative': 0.5682437517478534,
  'neutral': 0.12373573801237056,
  'positive': 0.30802051023977634},
 {'negative': 0.490346226842691,
  'neutral': 0.20864503305657886,
  'positive': 0.30100874010073014},
 {'negative': 0.5569197801361142,
  'neutral': 0.1342498783611709,
  'positive': 0.3088303415027147},
 {'negative': 0.5165487021938855,
  'neutral': 0.13998199029917185,
  'positive': 0.34346930750694543},
 {'negative': 0.23311742560677587,
  'neutral': 0.4182488090323352,
  'positive': 0.3486337653608891},
 {'negative': 0.8494818936945382,
  'neutral': 0.060109943158198856,
  'positive': 0.0904081631472596},
 {'negative': 0.2922247908043552,
  'neutral': 0.3367232807540181,
  'positive': 0.3710519284416263}]

List available Transformer models#

[4]:
malaya.sentiment.available_transformer()
INFO:malaya.sentiment:tested on test set at https://github.com/huseinzol05/malay-dataset/tree/master/sentiment/semisupervised-twitter-3class
[4]:
Size (MB) Quantized Size (MB) macro precision macro recall macro f1-score
bert 425.6 111.00 0.93182 0.93442 0.93307
tiny-bert 57.4 15.40 0.93390 0.93141 0.93262
albert 48.6 12.80 0.91228 0.91929 0.91540
tiny-albert 22.4 5.98 0.91442 0.91646 0.91521
xlnet 446.6 118.00 0.92390 0.92629 0.92444
alxlnet 46.8 13.30 0.91896 0.92589 0.92198

Load Transformer model#

def transformer(model: str = 'bert', quantized: bool = False, **kwargs):
    """
    Load Transformer sentiment model.

    Parameters
    ----------
    model: str, optional (default='bert')
        Check available models at `malaya.sentiment.available_transformer()`.
    quantized: bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: model
        List of model classes:

        * if `bert` in model, will return `malaya.model.bert.MulticlassBERT`.
        * if `xlnet` in model, will return `malaya.model.xlnet.MulticlassXLNET`.
        * if `fastformer` in model, will return `malaya.model.fastformer.MulticlassFastFormer`.
    """
[5]:
model = malaya.sentiment.transformer(model = 'xlnet')

Load Quantized model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[5]:
quantized_model = malaya.sentiment.transformer(model = 'xlnet', quantized = True)
WARNING:malaya_boilerplate.huggingface:Load quantized model will cause accuracy drop.
2022-10-12 10:55:07.109517: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:malaya_boilerplate.frozen_graph:running home/husein/.cache/huggingface/hub/models--huseinzol05--sentiment-v2-xlnet-quantized/snapshots/f4fae31e82ef3fdb6be6923805ea3c2b53b22ff5 using device /device:GPU:0
2022-10-12 10:55:07.135627: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:07.156923: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:07.157127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:08.304709: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:08.304929: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:08.305063: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:08.305226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /device:GPU:0 with 889 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090 Ti, pci bus id: 0000:01:00.0, compute capability: 8.6
2022-10-12 10:55:09.212090: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:09.212411: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:09.212587: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:09.212753: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:09.212908: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-12 10:55:09.213023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 889 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090 Ti, pci bus id: 0000:01:00.0, compute capability: 8.6

Predict batch of strings#

def predict(self, strings: List[str]):
    """
    classify list of strings.

    Parameters
    ----------
    strings: List[str]

    Returns
    -------
    result: List[str]
    """
[15]:
%%time

model.predict([string1, string2, string3, string4, string5, string6, string7])
CPU times: user 2.84 s, sys: 416 ms, total: 3.25 s
Wall time: 612 ms
[15]:
['negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'neutral']
[16]:
%%time

quantized_model.predict([string1, string2, string3, string4, string5, string6, string7])
CPU times: user 11.4 s, sys: 5.83 s, total: 17.2 s
Wall time: 14.5 s
[16]:
['negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'neutral']

Predict batch of strings with probability#

def predict_proba(self, strings: List[str]):
    """
    classify list of strings and return probability.

    Parameters
    ----------
    strings: List[str]

    Returns
    -------
    result: List[dict[str, float]]
    """
[17]:
%%time

model.predict_proba([string1, string2, string3, string4, string5, string6, string7])
CPU times: user 2.81 s, sys: 488 ms, total: 3.29 s
Wall time: 594 ms
[17]:
[{'negative': 0.99842525, 'neutral': 0.00012793706, 'positive': 0.001446799},
 {'negative': 0.99937123, 'neutral': 0.00011691217, 'positive': 0.0005116592},
 {'negative': 0.9967385, 'neutral': 0.00035874697, 'positive': 0.0029026929},
 {'negative': 6.8217414e-05, 'neutral': 5.6692165e-06, 'positive': 0.99992657},
 {'negative': 0.00143564, 'neutral': 0.004074719, 'positive': 0.9944896},
 {'negative': 0.9997349, 'neutral': 2.1702668e-05, 'positive': 0.00024327672},
 {'negative': 0.00013910382, 'neutral': 0.99177235, 'positive': 0.008088313}]
[18]:
%%time

quantized_model.predict_proba([string1, string2, string3, string4, string5, string6, string7])
CPU times: user 2.81 s, sys: 608 ms, total: 3.42 s
Wall time: 593 ms
[18]:
[{'negative': 0.9999466, 'neutral': 5.5334813e-06, 'positive': 4.8349822e-05},
 {'negative': 0.99953467, 'neutral': 3.8757844e-05, 'positive': 0.00042694603},
 {'negative': 0.9838357, 'neutral': 0.0016114257, 'positive': 0.014552869},
 {'negative': 7.5439406e-05, 'neutral': 5.8909416e-05, 'positive': 0.9998656},
 {'negative': 0.0016770269, 'neutral': 0.0050235414, 'positive': 0.9932996},
 {'negative': 0.99939466, 'neutral': 3.1535008e-05, 'positive': 0.0005735934},
 {'negative': 0.00060900406, 'neutral': 0.986869, 'positive': 0.012522108}]

Open sentiment visualization dashboard#

Default when you call predict_words it will open a browser with visualization dashboard, you can disable by visualization=False.

def predict_words(
    self,
    string: str,
    method: str = 'last',
    bins_size: float = 0.05,
    visualization: bool = True,
):
    """
    classify words.

    Parameters
    ----------
    string : str
    method : str, optional (default='last')
        Attention layer supported. Allowed values:

        * ``'last'`` - attention from last layer.
        * ``'first'`` - attention from first layer.
        * ``'mean'`` - average attentions from all layers.
    bins_size: float, optional (default=0.05)
        default bins size for word distribution histogram.
    visualization: bool, optional (default=True)
        If True, it will open the visualization dashboard.

    Returns
    -------
    dictionary: results
    """
[7]:
quantized_model.predict_words(string4, bins_size = 0.01)
2022-10-12 10:55:19.689328: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 98304000 exceeds 10% of free system memory.
2022-10-12 10:55:19.891915: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 98304000 exceeds 10% of free system memory.
2022-10-12 10:55:20.142518: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 98304000 exceeds 10% of free system memory.
2022-10-12 10:55:20.285553: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 98304000 exceeds 10% of free system memory.
2022-10-12 10:55:20.545254: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 98304000 exceeds 10% of free system memory.
2022-10-12 10:55:22.076725: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

Vectorize#

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize(self, strings: List[str], method: str = 'first'):
    """
    vectorize list of strings.

    Parameters
    ----------
    strings: List[str]
    method : str, optional (default='first')
        Vectorization layer supported. Allowed values:

        * ``'last'`` - vector from last sequence.
        * ``'first'`` - vector from first sequence.
        * ``'mean'`` - average vectors from all sequences.
        * ``'word'`` - average vectors based on tokens.

    Returns
    -------
    result: np.array
    """

Sentence level#

[5]:
r = quantized_model.vectorize([string1, string2, string3, string4], method = 'first')
[6]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(r)
tsne.shape
[6]:
(4, 2)
[7]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = [string1, string2, string3, string4]
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-sentiment_33_0.png

Word level#

[8]:
r = quantized_model.vectorize([string1, string2, string3, string4], method = 'word')
[9]:
x, y = [], []
for row in r:
    x.extend([i[0] for i in row])
    y.extend([i[1] for i in row])
[10]:
tsne = TSNE().fit_transform(y)
tsne.shape
[10]:
(129, 2)
[11]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-sentiment_38_0.png

Pretty good, the model able to know cluster top left as positive sentiment, bottom right as negative sentiment.

Stacking models#

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

[14]:
multinomial = malaya.sentiment.multinomial()
alxlnet = malaya.sentiment.transformer(model = 'alxlnet')
[15]:
malaya.stack.predict_stack([multinomial, alxlnet, model],
                           [string1, string2, string3, string4, string5, string6, string7])
[15]:
[{'negative': 0.8274227410462278,
  'neutral': 0.0023870527199649325,
  'positive': 0.004245345202445141},
 {'negative': 0.7879056397481532,
  'neutral': 0.0008830732854325976,
  'positive': 0.005592372403475869},
 {'negative': 0.7093363019692934,
  'neutral': 0.0055568261277720325,
  'positive': 0.12551631691737208},
 {'negative': 0.0034061439340994006,
  'neutral': 0.001578504598740415,
  'positive': 0.7000850686374817},
 {'negative': 0.008312327088611552,
  'neutral': 0.010385559647356483,
  'positive': 0.7007443982567323},
 {'negative': 0.946942812698766,
  'neutral': 9.221052225822323e-05,
  'positive': 0.0005205126361222257},
 {'negative': 0.0008040273612030484,
  'neutral': 0.6952594227897851,
  'positive': 0.003661718247768115}]
[ ]: