Classification#

This tutorial is available as an IPython notebook at Malaya/example/zeroshot-classification.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

This interface deprecated, use HuggingFace interface instead.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
[2]:
import logging

logging.basicConfig(level=logging.INFO)
[3]:
%%time
import malaya
CPU times: user 3.16 s, sys: 3.57 s, total: 6.73 s
Wall time: 2.17 s
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3372
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3890
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[4]:
import warnings
warnings.filterwarnings('default')

what is zero-shot classification#

Commonly we supervised a machine learning on specific labels, negative / positive for sentiment, anger / happy / sadness for emotion and etc. The model cannot give an output if we want to know how much percentage of ‘jealous’ in emotion analysis model because supported labels are only {anger, happy, sadness}. Imagine, for example, trying to identify a text without ever having seen one ‘jealous’ label before, impossible. So, zero-shot trying to solve this problem.

zero-shot learning refers to the process by which a machine learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification.

Yin et al. (2019) stated in his paper, any pretrained language model finetuned on text similarity actually can acted as an out-of-the-box zero-shot text classifier.

So, we are going to use transformer models from malaya.similarity.semantic.transformer with a little tweaks.

List available Transformer models#

[5]:
malaya.zero_shot.classification.available_transformer()
/home/husein/dev/malaya/malaya/zero_shot/classification.py:20: DeprecationWarning: `malaya.zero_shot.classification.available_transformer` is deprecated, use `malaya.zero_shot.classification.available_huggingface` instead
  warnings.warn(
INFO:malaya.similarity.semantic:tested on matched dev set translated MNLI, https://huggingface.co/datasets/mesolitica/translated-MNLI
[5]:
Size (MB) Quantized Size (MB) macro precision macro recall macro f1-score
bert 423.4 111.0 0.88315 0.88656 0.88405
tiny-bert 56.6 15.0 0.87210 0.87546 0.87292
albert 48.3 12.8 0.87164 0.87146 0.87155
tiny-albert 21.9 6.0 0.82234 0.82383 0.82295
xlnet 448.7 119.0 0.80866 0.76775 0.77112
alxlnet 49.0 13.9 0.88756 0.88700 0.88727

Load transformer model#

In this example, I am going to load alxlnet, feel free to use any available models above.

def transformer(model: str = 'bert', quantized: bool = False, **kwargs):
    """
    Load Transformer zero-shot model.

    Parameters
    ----------
    model: str, optional (default='bert')
        Check available models at `malaya.zero_shot.classification.available_transformer()`.
    quantized: bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: model
        List of model classes:

        * if `bert` in model, will return `malaya.model.bert.ZeroshotBERT`.
        * if `xlnet` in model, will return `malaya.model.xlnet.ZeroshotXLNET`.
    """
[6]:
model = malaya.zero_shot.classification.transformer(model = 'alxlnet')
/home/husein/dev/malaya/malaya/zero_shot/classification.py:58: DeprecationWarning: `malaya.zero_shot.classification.transformer` is deprecated, use `malaya.zero_shot.classification.huggingface` instead
  warnings.warn(
2022-11-02 21:52:14.649973: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-02 21:52:14.654002: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-02 21:52:14.654018: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-11-02 21:52:14.654025: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2022-11-02 21:52:14.654096: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-11-02 21:52:14.654115: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3

Load Quantized model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[7]:
quantized_model = malaya.zero_shot.classification.transformer(model = 'alxlnet', quantized = True)
WARNING:malaya_boilerplate.huggingface:Load quantized model will cause accuracy drop.

predict batch#

def predict_proba(self, strings: List[str], labels: List[str]):
    """
    classify list of strings and return probability.

    Parameters
    ----------
    strings : List[str]
    labels : List[str]

    Returns
    -------
    list: list of float
    """

Because it is a zero-shot, we need to give labels for the model.

[8]:
# copy from twitter

string = 'gov macam bengong, kami nk pilihan raya, gov backdoor, sakai'
[9]:
model.predict_proba([string], labels = ['najib razak', 'mahathir', 'kerajaan', 'PRU', 'anarki'])
[9]:
[{'najib razak': 0.023044204,
  'mahathir': 0.03988692,
  'kerajaan': 0.038086187,
  'PRU': 0.94541705,
  'anarki': 0.022404708}]
[10]:
quantized_model.predict_proba([string], labels = ['najib razak', 'mahathir', 'kerajaan', 'PRU', 'anarki'])
[10]:
[{'najib razak': 0.0032942458,
  'mahathir': 0.1925324,
  'kerajaan': 0.11548952,
  'PRU': 0.25383464,
  'anarki': 0.08174919}]

Quite good.

[11]:
string = 'tolong order foodpanda jab, lapar'
[12]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery'])
[12]:
[{'makan': 0.15020105,
  'makanan': 0.8327125,
  'novel': 0.0066710957,
  'buku': 0.0027667829,
  'kerajaan': 0.0004485665,
  'food delivery': 0.89098865}]

the model understood order foodpanda got close relationship with makan, makanan and food delivery.

[13]:
string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'
[14]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat'])
[14]:
[{'makan': 0.0055492218,
  'makanan': 0.0069967406,
  'novel': 0.007987897,
  'buku': 0.00091687334,
  'kerajaan': 0.95299816,
  'food delivery': 0.00995513,
  'kerajaan jahat': 0.24344432,
  'kerajaan prihatin': 0.98162764,
  'bantuan rakyat': 0.9598133}]

Vectorize#

Let say you want to visualize sentence / word level in lower dimension, you can use model.vectorize,

def vectorize(
    self, strings: List[str], labels: List[str], method: str = 'first'
):
    """
    vectorize a string.

    Parameters
    ----------
    strings: List[str]
    labels : List[str]
    method : str, optional (default='first')
        Vectorization layer supported. Allowed values:

        * ``'last'`` - vector from last sequence.
        * ``'first'`` - vector from first sequence.
        * ``'mean'`` - average vectors from all sequences.
        * ``'word'`` - average vectors based on tokens.


    Returns
    -------
    result: np.array
    """

Sentence level#

[4]:
texts = ['kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan',
        'gov macam bengong, kami nk pilihan raya, gov backdoor, sakai',
        'tolong order foodpanda jab, lapar',
        'Hapuskan vernacular school first, only then we can talk about UiTM']
labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
          'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat']
r = quantized_model.vectorize(texts, labels, method = 'first')

vectorize method from zeroshot classification model will returned 2 values, (combined, vector).

[5]:
r[0][:5]
[5]:
[('kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan',
  'makan'),
 ('kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan',
  'makanan'),
 ('kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan',
  'novel'),
 ('kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan',
  'buku'),
 ('kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan',
  'kerajaan')]
[6]:
r[1]
[6]:
array([[-0.00587193, -0.7214614 , -0.7524409 , ...,  0.31107777,
         1.022762  ,  0.28308758],
       [ 0.63863456,  0.12698255,  0.67567766, ...,  0.7627216 ,
         0.56795114, -0.37056473],
       [-0.90291303,  0.93581504,  0.05650915, ...,  0.5578094 ,
         1.1304276 ,  0.5470246 ],
       ...,
       [-2.1161728 , -1.4592253 ,  0.5284856 , ...,  0.28636536,
        -0.36558965, -0.8226106 ],
       [-2.2050292 , -0.14624506,  0.19812807, ...,  0.1307496 ,
        -0.20792441,  0.18430969],
       [-2.5969799 ,  0.4205628 ,  0.18376699, ...,  0.124988  ,
        -0.9915105 , -0.10085672]], dtype=float32)
[7]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(r[1])
tsne.shape
[7]:
(36, 2)
[8]:
unique_labels = list(set([i[1] for i in r[0]]))
palette = plt.cm.get_cmap('hsv', len(unique_labels))
[9]:
plt.figure(figsize = (7, 7))

for label in unique_labels:
    indices = [i for i in range(len(r[0])) if r[0][i][1] == label]
    plt.scatter(tsne[indices, 0], tsne[indices, 1], cmap = palette(unique_labels.index(label)),
               label = label)

labels = [i[0] for i in r[0]]
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
plt.legend()
[9]:
<matplotlib.legend.Legend at 0x14c7588d0>
_images/load-zeroshot-classification_34_1.png

Word level#

[28]:
texts = ['kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan',
        'gov macam bengong, kami nk pilihan raya, gov backdoor, sakai',
        'tolong order foodpanda jab, lapar',
        'Hapuskan vernacular school first, only then we can talk about UiTM']
labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
          'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat']
r = quantized_model.vectorize(texts, labels, method = 'word')
[29]:
x, y, labels = [], [], []
for no, row in enumerate(r[1]):
    x.extend([i[0] for i in row])
    y.extend([i[1] for i in row])
    labels.extend([r[0][no][1]] * len(row))
[30]:
tsne = TSNE().fit_transform(y)
tsne.shape
[30]:
(315, 2)
[31]:
unique_labels = list(set(labels))
palette = plt.cm.get_cmap('hsv', len(unique_labels))
[32]:
plt.figure(figsize = (7, 7))

for label in unique_labels:
    indices = [i for i in range(len(labels)) if labels[i] == label]
    plt.scatter(tsne[indices, 0], tsne[indices, 1], cmap = palette(unique_labels.index(label)),
               label = label)

labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
plt.legend()
[32]:
<matplotlib.legend.Legend at 0x14b158a10>
_images/load-zeroshot-classification_40_1.png

Stacking models#

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

If you want to stack zero-shot classification models, you need to pass labels using keyword parameter,

malaya.stack.predict_stack([model1, model2], List[str], labels = List[str])

We will passed labels as **kwargs.

[10]:
alxlnet = malaya.zero_shot.classification.transformer(model = 'alxlnet')
albert = malaya.zero_shot.classification.transformer(model = 'albert')
tiny_bert = malaya.zero_shot.classification.transformer(model = 'tiny-bert')
WARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:loading sentence piece model
[11]:
string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'
labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
 'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat']
malaya.stack.predict_stack([alxlnet, albert, tiny_bert], [string],
                           labels = labels)
[11]:
[{'makan': 0.0044827852,
  'makanan': 0.0027062024,
  'novel': 0.0020867025,
  'buku': 0.013082165,
  'kerajaan': 0.8859287,
  'food delivery': 0.0028363755,
  'kerajaan jahat': 0.018133936,
  'kerajaan prihatin': 0.9922408,
  'bantuan rakyat': 0.909674}]
[ ]: