Entity Recognition HuggingFace#

This tutorial is available as an IPython notebook at Malaya/example/zeroshot-ner-huggingface.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
[2]:
import logging

logging.basicConfig(level=logging.INFO)
[3]:
%%time
import malaya
CPU times: user 2.98 s, sys: 2.65 s, total: 5.63 s
Wall time: 2.55 s

what is zero-shot entity recognition#

Commonly we supervised a machine learning on specific labels, PERSON, NORP and etc. The model cannot give an output if we want to know how many tags of ‘POLITICIAN’ in entity recognition model because supported labels are only {PERSON, NORP}. Imagine, for example, trying to identify a text without ever having seen one ‘POLITICIAN’ label before, impossible. So, zero-shot trying to solve this problem.

zero-shot learning refers to the process by which a machine learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification.

We follow prompting GPTJ for entity extraction and finetuned T5 models, https://playground.helloforefront.com/models/free-gpt-j-playground

List available HuggingFace models#

[4]:
malaya.zero_shot.entity.available_huggingface()
INFO:malaya.zero_shot.entity:tested on test set, https://huggingface.co/datasets/mesolitica/zeroshot-NER
[4]:
Size (MB) WER CER exactly-match
mesolitica/finetune-zeroshot-ner-t5-tiny-standard-bahasa-cased 139 0 0 0
mesolitica/finetune-zeroshot-ner-t5-small-standard-bahasa-cased 242 0 0 0
mesolitica/finetune-zeroshot-ner-t5-base-standard-bahasa-cased 892 0 0 0

Load HuggingFace model#

def huggingface(model: str = 'mesolitica/finetune-zeroshot-ner-t5-small-standard-bahasa-cased', **kwargs):
    """
    Load HuggingFace model to zeroshot NER.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-zeroshot-ner-t5-small-standard-bahasa-cased')
        Check available models at `malaya.zero_shot.entity.available_huggingface()`.

    Returns
    -------
    result: malaya.torch_model.huggingface.ZeroShotNER
    """
[5]:
model = malaya.zero_shot.entity.huggingface()

predict#

def predict(
    self,
    string: str,
    tags: List[str],
    minimum_length: int = 2,
    **kwargs,
):
    """
    classify entities in a string.

    Parameters
    ----------
    strings: str
        We assumed the string input been properly tokenized.
    tags: List[str]
    minimum_length: int, optional (default=2)
        minimum length of string for an entity.
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    list: Dict[str, List[str]]
    """
[6]:
from unidecode import unidecode
import re

# minimum cleaning, just simply to remove newlines.
def cleaning(string):
    string = string.replace('\n', ' ')
    string = re.sub(r'[ ]+', ' ', string).strip()
    return string

x = """
Bekas Ahli Parlimen Jempol empat penggal Khalid Yunus akan membuat penampilan semula pada pilihan raya umum ke-15 (PRU15).

Khalid, 79, yang juga timbalan presiden Parti Bumiputera Perkasa Malaysia (Putra) akan bertanding atas tiket Pejuang menentang Shamshulkahar Mohd Deli daripada Barisan Nasional (BN), calon Perikatan Nasional (PN) Noraffendy Salleh dan calon Pakatan Harapan (PH) Norwani Ahmat.
"""

text = cleaning(x)
tags = ['nama seseorang', 'organisasi', 'kopi', 'parti politik', 'masa', 'food and beverages',
       'age', 'politician', 'ahli politik', 'orang politik', 'parti politik malaysia']

model.predict(text, tags, do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95, )
[6]:
{'nama seseorang': ['Khalid Yunus',
  'Noraffendy Salleh',
  'Khalid,',
  'Norwani Ahmat.',
  'Shamshulkahar Mohd Deli',
  'Ahli Parlimen Jempol'],
 'organisasi': ['Pakatan Harapan (PH)',
  'Pejuang',
  'Perikatan Nasional (PN)',
  'Parti Bumiputera Perkasa Malaysia (Putra)',
  'Barisan Nasional (BN),'],
 'kopi': [],
 'parti politik': [],
 'masa': [],
 'food and beverages': [],
 'age': [],
 'politician': [],
 'ahli politik': [],
 'orang politik': ['Khalid Yunus',
  'Noraffendy Salleh',
  'Khalid,',
  'Norwani Ahmat.',
  'Shamshulkahar Mohd Deli'],
 'parti politik malaysia': []}
[7]:
text = 'saya nak secawan kopi dengan satu krim dan tiga gula , dan saya sekarang berada di penang airport'
tags = ['person', 'organisasi', 'kopi', 'nama parti politik', 'lokasi', 'syarikat', 'masa',
       'quantity', 'kuantiti', 'coffee', 'airport', 'politician', 'lapangan terbang']

model.predict(text, tags, do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95, )
[7]:
{'person': [],
 'organisasi': [],
 'kopi': ['tiga gula', 'satu krim'],
 'nama parti politik': [],
 'lokasi': ['penang airport'],
 'syarikat': [],
 'masa': [],
 'quantity': ['tiga gula', 'satu krim'],
 'kuantiti': [],
 'coffee': ['tiga gula', 'satu krim'],
 'airport': [],
 'politician': [],
 'lapangan terbang': []}
[8]:
text = 'sya nak 1 teh o ais, dan saya sekarang berada di penang airport sambil minum starbucks'
tags = ['person', 'organisasi', 'kopi', 'nama parti politik', 'lokasi', 'syarikat', 'masa',
       'quantity', 'kuantiti', 'coffee', 'airport', 'drink', 'fnb', 'makanan dan minuman']

model.predict(text, tags, do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95, )
[8]:
{'person': [],
 'organisasi': [],
 'kopi': [],
 'nama parti politik': [],
 'lokasi': ['penang airport'],
 'syarikat': [],
 'masa': [],
 'quantity': [],
 'kuantiti': [],
 'coffee': ['1 teh o ais'],
 'airport': [],
 'drink': [],
 'fnb': ['1 teh o ais'],
 'makanan dan minuman': ['1 teh o ais']}

able to infer for mixed MS and EN#

[9]:
text = 'The Square is located by the historic Harvard Yard where you find buildings dating back as far as the early 18th century .'
tags = ['year', 'person', 'orang', 'org', 'people', 'tahun', 'lokasi', 'tempat', 'masa', 'time']

model.predict(text, tags, do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95,)
[9]:
{'year': [],
 'person': [],
 'orang': [],
 'org': ['Harvard Yard'],
 'people': [],
 'tahun': [],
 'lokasi': ['Square'],
 'tempat': [],
 'masa': [],
 'time': []}
[10]:
x = """
PENAMPANG:  Datuk Seri Anwar Ibrahim is obsessed with becoming  prime minister, said Parti  Warisan president Datuk Seri Mohd Shafie Apdal.

He said this could be seen when Pakatan Harapan chairman Anwar  was released from jail following the coalition's win in the 14th general election.
"""
text = cleaning(x)
tags = ['year', 'person', 'orang', 'org', 'people', 'tahun', 'lokasi', 'tempat', 'masa', 'time', 'position',
       'title']

model.predict(text, tags, do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95,)
[10]:
{'year': [],
 'person': ['Anwar',
  'Datuk Seri Mohd Shafie Apdal.',
  'Datuk Seri Anwar Ibrahim'],
 'orang': ['Anwar',
  'Datuk Seri Mohd Shafie Apdal.',
  'Datuk Seri Anwar Ibrahim'],
 'org': ['Pakatan Harapan', 'Parti Warisan'],
 'people': ['Anwar',
  'Datuk Seri Mohd Shafie Apdal.',
  'Datuk Seri Anwar Ibrahim'],
 'tahun': [],
 'lokasi': [],
 'tempat': [],
 'masa': [],
 'time': [],
 'position': [],
 'title': []}