Classification HuggingFace#

This tutorial is available as an IPython notebook at Malaya/example/zeroshot-classification-huggingface.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
import logging

import malaya
CPU times: user 3.1 s, sys: 3.47 s, total: 6.57 s
Wall time: 2.28 s

what is zero-shot classification#

Commonly we supervised a machine learning on specific labels, negative / positive for sentiment, anger / happy / sadness for emotion and etc. The model cannot give an output if we want to know how much percentage of ‘jealous’ in emotion analysis model because supported labels are only {anger, happy, sadness}. Imagine, for example, trying to identify a text without ever having seen one ‘jealous’ label before, impossible. So, zero-shot trying to solve this problem.

zero-shot learning refers to the process by which a machine learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification.

Yin et al. (2019) stated in his paper, any pretrained language model finetuned on text similarity actually can acted as an out-of-the-box zero-shot text classifier.

So, we are going to use transformer models from malaya.similarity.semantic.huggingface with a little tweaks.

List available HuggingFace models#

INFO:malaya.similarity.semantic:tested on matched dev set translated MNLI,
Size (MB) macro precision macro recall macro f1-score
mesolitica/finetune-mnli-t5-super-tiny-standard-bahasa-cased 50.7 0.74562 0.74574 0.74501
mesolitica/finetune-mnli-t5-tiny-standard-bahasa-cased 139.0 0.76584 0.76565 0.76542
mesolitica/finetune-mnli-t5-small-standard-bahasa-cased 242.0 0.78067 0.78063 0.78010
mesolitica/finetune-mnli-t5-base-standard-bahasa-cased 892.0 0.78903 0.79064 0.78918

Load HuggingFace model#

def huggingface(model: str = 'mesolitica/finetune-mnli-t5-small-standard-bahasa-cased', **kwargs):
    Load HuggingFace model to zeroshot text classification.

    model: str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')
        Check available models at `malaya.zero_shot.classification.available_huggingface()`.

    result: malaya.torch_model.huggingface.ZeroShotClassification
model = malaya.zero_shot.classification.huggingface()

predict batch#

def predict_proba(
    strings: List[str],
    labels: List[str],
    prefix: str = 'ayat ini berkaitan tentang',
    multilabel: bool = True,
    classify list of strings and return probability.

    strings: List[str]
    labels: List[str]
    prefix: str, optional (default='ayat ini berkaitan tentang')
        prefix of labels to zero shot. Playing around with prefix can get better results.
    multilabel: bool, optional (default=True)
        probability of labels can be more than 1.0

Because it is a zero-shot, we need to give labels for the model.

# copy from twitter

string = 'gov macam bengong, kami nk pilihan raya, gov backdoor, sakai'
model.predict_proba([string], labels = ['najib razak', 'mahathir', 'kerajaan', 'PRU', 'anarki'])
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[{'najib razak': 0.6651765,
  'mahathir': 0.987833,
  'kerajaan': 0.9912515,
  'PRU': 0.9841426,
  'anarki': 0.45587578}]
string = 'tolong order foodpanda jab, lapar'
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery'])
[{'makan': 0.9698464,
  'makanan': 0.9735605,
  'novel': 0.19823082,
  'buku': 0.00313239,
  'kerajaan': 0.12976034,
  'food delivery': 0.99331254}]

the model understood order foodpanda got close relationship with makan, makanan and food delivery.

string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat'])
[{'makan': 0.0004689095,
  'makanan': 0.0026079589,
  'novel': 0.29850212,
  'buku': 0.025044106,
  'kerajaan': 0.76523817,
  'food delivery': 0.0044676424,
  'kerajaan jahat': 0.0023713536,
  'kerajaan prihatin': 0.9468328,
  'bantuan rakyat': 0.9923975}]

able to infer for mixed MS and EN#

string = 'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:'
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',
                                       'biskut', 'very helpful', 'sharing experiences',
                                       'sharing session'])
[{'makan': 0.17769064,
  'makanan': 0.94145703,
  'novel': 0.51651853,
  'buku': 0.21957111,
  'kerajaan': 0.11726684,
  'food delivery': 0.903062,
  'kerajaan jahat': 0.33357194,
  'kerajaan prihatin': 0.14763993,
  'bantuan rakyat': 0.5784646,
  'biskut': 0.8355128,
  'very helpful': 0.39513826,
  'sharing experiences': 0.64116335,
  'sharing session': 0.675511}]
model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',
                                       'biskut', 'very helpful', 'sharing experiences',
                                       'sharing session'],
                   prefix = 'teks ini berkaitan tentang')
[{'makan': 0.23804268,
  'makanan': 0.94474393,
  'novel': 0.8238379,
  'buku': 0.3343829,
  'kerajaan': 0.092507444,
  'food delivery': 0.94236046,
  'kerajaan jahat': 0.15810412,
  'kerajaan prihatin': 0.13604635,
  'bantuan rakyat': 0.55307525,
  'biskut': 0.92333925,
  'very helpful': 0.39841577,
  'sharing experiences': 0.7563246,
  'sharing session': 0.86674726}]

Multiclasses but not multilabel#

Sum of probability equal to 1.0, so to do that, set multilabel=False.

string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'

model.predict_proba([string], labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
                                       'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat',
                                       'biskut', 'very helpful', 'sharing experiences',
                                       'sharing session'], multilabel = False)
[{'makan': 0.0013036507,
  'makanan': 0.0012489067,
  'novel': 0.007235752,
  'buku': 0.0022450346,
  'kerajaan': 0.070251726,
  'food delivery': 0.0042558503,
  'kerajaan jahat': 0.0022728115,
  'kerajaan prihatin': 0.20736308,
  'bantuan rakyat': 0.57145786,
  'biskut': 0.0020565772,
  'very helpful': 0.11333891,
  'sharing experiences': 0.007458821,
  'sharing session': 0.00951114}]

Stacking models#

More information, you can read at

If you want to stack zero-shot classification models, you need to pass labels using keyword parameter,

malaya.stack.predict_stack([model1, model2], List[str], labels = List[str])

We will passed labels as **kwargs.

string = 'kerajaan sebenarnya sangat prihatin dengan rakyat, bagi duit bantuan'
labels = ['makan', 'makanan', 'novel', 'buku', 'kerajaan', 'food delivery',
 'kerajaan jahat', 'kerajaan prihatin', 'bantuan rakyat', 'comel', 'kerajaan syg sgt kepada rakyat']
malaya.stack.predict_stack([model, model, model], [string],
                           labels = labels)
[{'makan': 0.00046890916,
  'makanan': 0.0026079628,
  'novel': 0.29850233,
  'buku': 0.02504399,
  'kerajaan': 0.7652382,
  'food delivery': 0.004467653,
  'kerajaan jahat': 0.0023713524,
  'kerajaan prihatin': 0.9468329,
  'bantuan rakyat': 0.99239755,
  'comel': 0.00077307917,
  'kerajaan syg sgt kepada rakyat': 0.9818335}]