Classification#

This tutorial is available as an IPython notebook at Malaya/example/zeroshot-classification.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
[2]:
%%time
import malaya
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CPU times: user 3.27 s, sys: 3.22 s, total: 6.5 s
Wall time: 2.64 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

what is zero-shot classification#

Commonly we supervised a machine learning on specific labels, negative / positive for sentiment, anger / happy / sadness for emotion and etc. The model cannot give an output if we want to know how much percentage of ‘jealous’ in emotion analysis model because supported labels are only {anger, happy, sadness}. Imagine, for example, trying to identify a text without ever having seen one ‘jealous’ label before, impossible. So, zero-shot trying to solve this problem.

zero-shot learning refers to the process by which a machine learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification.

Yin et al. (2019) stated in his paper, any pretrained language model finetuned on text similarity actually can acted as an out-of-the-box zero-shot text classifier.

So, we are going to use transformer models from malaya.similarity.semantic.huggingface with a little tweaks.

List available HuggingFace models#

[3]:
malaya.zero_shot.classification.available_huggingface
[3]:
{'mesolitica/finetune-mnli-nanot5-small': {'Size (MB)': 148,
  'macro precision': 0.87125,
  'macro recall': 0.87131,
  'macro f1-score': 0.87127},
 'mesolitica/finetune-mnli-nanot5-base': {'Size (MB)': 892,
  'macro precision': 0.78903,
  'macro recall': 0.79064,
  'macro f1-score': 0.78918}}

Load HuggingFace model#

def huggingface(
    model: str = 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to zeroshot text classification.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')
        Check available models at `malaya.zero_shot.classification.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.ZeroShotClassification
    """
[16]:
model = malaya.zero_shot.classification.huggingface()

predict batch#

def predict_proba(
    self,
    strings: List[str],
    labels: List[str],
    prefix: str = 'ayat ini berkaitan tentang',
    multilabel: bool = True,
):
    """
    classify list of strings and return probability.

    Parameters
    ----------
    strings: List[str]
    labels: List[str]
    prefix: str, optional (default='ayat ini berkaitan tentang')
        prefix of labels to zero shot. Playing around with prefix can get better results.
    multilabel: bool, optional (default=True)
        probability of labels can be more than 1.0

Because it is a zero-shot, we need to give labels for the model.

[5]:
# copy from twitter

string = 'gov macam bengong, kami nk pilihan raya, gov backdoor, sakai'
[6]:
model.predict_proba([string], labels = ['najib razak', 'mahathir', 'kerajaan', 'PRU', 'anarki'])
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[6]:
[{'najib razak': 0.089086466,
  'mahathir': 0.8503896,
  'kerajaan': 0.31621307,
  'PRU': 0.5521264,
  'anarki': 0.018142236}]
[7]:
string = 'tolong order foodpanda jab, lapar'
[8]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery'])
[8]:
[{'makan': 0.9966216,
  'makanan': 0.9912846,
  'novel': 0.01200958,
  'buku': 0.0026836568,
  'kerajaan': 0.005800651,
  'food delivery': 0.94829154}]

the model understood order foodpanda got close relationship with makan, makanan and food delivery.

[9]:
string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'
[10]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat'])
[10]:
[{'makan': 0.0023917605,
  'makanan': 0.002768525,
  'novel': 0.0035945452,
  'buku': 0.0028883144,
  'kerajaan': 0.9981665,
  'food delivery': 0.0029965744,
  'kerajaan jahat': 0.95778364,
  'kerajaan prihatin': 0.9981933,
  'bantuan rakyat': 0.99804246}]

able to infer for mixed MS and EN#

[11]:
string = 'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:'
[12]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',
                                       'biskut', 'very helpful', 'sharing experiences',
                                       'sharing session'])
[12]:
[{'makan': 0.007691883,
  'makanan': 0.997271,
  'novel': 0.039510652,
  'buku': 0.03565315,
  'kerajaan': 0.0074525476,
  'food delivery': 0.9393526,
  'kerajaan jahat': 0.0053522647,
  'kerajaan prihatin': 0.011083162,
  'bantuan rakyat': 0.060150616,
  'biskut': 0.9302781,
  'very helpful': 0.07355973,
  'sharing experiences': 0.9778896,
  'sharing session': 0.014371477}]
[13]:
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',
                                       'biskut', 'very helpful', 'sharing experiences',
                                       'sharing session'],
                   prefix = 'teks ini berkaitan tentang')
[13]:
[{'makan': 0.0014243807,
  'makanan': 0.004838416,
  'novel': 0.0019961353,
  'buku': 0.003897282,
  'kerajaan': 0.004189471,
  'food delivery': 0.97480994,
  'kerajaan jahat': 0.0018161167,
  'kerajaan prihatin': 0.0054033417,
  'bantuan rakyat': 0.0054734466,
  'biskut': 0.018219633,
  'very helpful': 0.03659028,
  'sharing experiences': 0.98463523,
  'sharing session': 0.013350475}]

Multiclasses but not multilabel#

Sum of probability equal to 1.0, so to do that, set multilabel=False.

[14]:
string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'

model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',
                                       'biskut', 'very helpful', 'sharing experiences',
                                       'sharing session'], multilabel = False)
[14]:
[{'makan': 0.00062935066,
  'makanan': 0.00067746383,
  'novel': 0.0007715335,
  'buku': 0.0006922778,
  'kerajaan': 0.2833456,
  'food delivery': 0.0007045073,
  'kerajaan jahat': 0.05875754,
  'kerajaan prihatin': 0.28552753,
  'bantuan rakyat': 0.27457199,
  'biskut': 0.0007160352,
  'very helpful': 0.09099287,
  'sharing experiences': 0.0012673552,
  'sharing session': 0.0013456849}]

Stacking models#

More information, you can read at https://malaya.readthedocs.io/en/latest/Stack.html

If you want to stack zero-shot classification models, you need to pass labels using keyword parameter,

malaya.stack.predict_stack([model1, model2], List[str], labels = List[str])

We will passed labels as **kwargs.

[15]:
string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'
labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
 'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat', 'comel', 'kerajaan syg sgt kepada rakyat']
malaya.stack.predict_stack([model, model, model], [string],
                           labels = labels)
[15]:
[{'makan': 0.0023917593,
  'makanan': 0.002768525,
  'novel': 0.0035945452,
  'buku': 0.0028883128,
  'kerajaan': 0.9981665,
  'food delivery': 0.0029965725,
  'kerajaan jahat': 0.95778376,
  'kerajaan prihatin': 0.9981934,
  'bantuan rakyat': 0.9980425,
  'comel': 0.0031943405,
  'kerajaan syg sgt kepada rakyat': 0.99586475}]