Entities Recognition

This tutorial is available as an IPython notebook at Malaya/example/entities.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:
%%time
import malaya
CPU times: user 4.79 s, sys: 950 ms, total: 5.74 s
Wall time: 6.79 s

Describe supported entities

[2]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
malaya.entity.describe()
[2]:
Tag Description
0 OTHER other
1 law law, regulation, related law documents, documents, etc
2 location location, place
3 organization organization, company, government, facilities, etc
4 person person, group of people, believes, unique arts (eg; food, drink), etc
5 quantity numbers, quantity
6 time date, day, time, etc
7 event unique event happened, etc

Describe supported Ontonotes 5 entities

[3]:
malaya.entity.describe_ontonotes5()
[3]:
Tag Description
0 OTHER other
1 ADDRESS Address of physical location.
2 PERSON People, including fictional.
3 NORP Nationalities or religious or political groups.
4 FAC Buildings, airports, highways, bridges, etc.
5 ORG Companies, agencies, institutions, etc.
6 GPE Countries, cities, states.
7 LOC Non-GPE locations, mountain ranges, bodies of water.
8 PRODUCT Objects, vehicles, foods, etc. (Not services.)
9 EVENT Named hurricanes, battles, wars, sports events, etc.
10 WORK_OF_ART Titles of books, songs, etc.
11 LAW Named documents made into laws.
12 LANGUAGE Any named language.
13 DATE Absolute or relative dates or periods.
14 TIME Times smaller than a day.
15 PERCENT Percentage, including "%".
16 MONEY Monetary values, including unit.
17 QUANTITY Measurements, as of weight or distance.
18 ORDINAL "first", "second", etc.
19 CARDINAL Numerals that do not fall under another type.

List available Transformer NER models

[4]:
malaya.entity.available_transformer()
INFO:root:tested on 20% test set.
[4]:
Size (MB) Quantized Size (MB) macro precision macro recall macro f1-score
bert 425.4 111.00 0.99291 0.97864 0.98537
tiny-bert 57.7 15.40 0.98151 0.94754 0.96134
albert 48.6 12.80 0.98026 0.95332 0.96492
tiny-albert 22.4 5.98 0.96100 0.90363 0.92374
xlnet 446.6 118.00 0.99344 0.98154 0.98725
alxlnet 46.8 13.30 0.99215 0.97575 0.98337

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/models-accuracy.html#Entities-Recognition

List available Transformer NER Ontonotes 5 models

[5]:
malaya.entity.available_transformer_ontonotes5()
INFO:root:tested on 20% test set.
[5]:
Size (MB) Quantized Size (MB) macro precision macro recall macro f1-score
bert 425.4 111.00 0.94460 0.93244 0.93822
tiny-bert 57.7 15.40 0.91908 0.91635 0.91704
albert 48.6 12.80 0.93010 0.92341 0.92636
tiny-albert 22.4 5.98 0.90298 0.88251 0.89145
xlnet 446.6 118.00 0.93814 0.95021 0.94388
alxlnet 46.8 13.30 0.93244 0.92942 0.93047

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/models-accuracy.html#Entities-Recognition-Ontonotes5

[36]:
string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar  sekiranya mengantuk ketika memandu.'
string1 = 'memperkenalkan Husein, dia sangat comel, berumur 25 tahun, bangsa melayu, agama islam, tinggal di cyberjaya malaysia, bercakap bahasa melayu, semua membaca buku undang-undang kewangan, dengar laju Siti Nurhaliza - Seluruh Cinta sambil makan ayam goreng KFC'

Load Transformer model

def transformer(model: str = 'xlnet', quantized: bool = False, **kwargs):
    """
    Load Transformer Entity Tagging model, transfer learning Transformer + CRF.

    Parameters
    ----------
    model : str, optional (default='bert')
        Model architecture supported. Allowed values:

        * ``'bert'`` - Google BERT BASE parameters.
        * ``'tiny-bert'`` - Google BERT TINY parameters.
        * ``'albert'`` - Google ALBERT BASE parameters.
        * ``'tiny-albert'`` - Google ALBERT TINY parameters.
        * ``'xlnet'`` - Google XLNET BASE parameters.
        * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya.supervised.tag.transformer function
    """
[7]:
model = malaya.entity.transformer(model = 'alxlnet')
INFO:root:running entity/alxlnet using device /device:CPU:0

Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[8]:
quantized_model = malaya.entity.transformer(model = 'alxlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running entity/alxlnet-quantized using device /device:CPU:0

Predict

def predict(self, string: str):
    """
    Tag a string.

    Parameters
    ----------
    string : str

    Returns
    -------
    result: Tuple[str, str]
    """
[9]:
model.predict(string)
[9]:
[('KUALA', 'location'),
 ('LUMPUR', 'location'),
 (':', 'OTHER'),
 ('Sempena', 'OTHER'),
 ('sambutan', 'OTHER'),
 ('Aidilfitri', 'event'),
 ('minggu', 'time'),
 ('depan', 'time'),
 (',', 'OTHER'),
 ('Perdana', 'person'),
 ('Menteri', 'person'),
 ('Tun', 'person'),
 ('Dr', 'person'),
 ('Mahathir', 'person'),
 ('Mohamad', 'person'),
 ('dan', 'OTHER'),
 ('Menteri', 'organization'),
 ('Pengangkutan', 'organization'),
 ('Anthony', 'person'),
 ('Loke', 'person'),
 ('Siew', 'person'),
 ('Fook', 'person'),
 ('menitipkan', 'OTHER'),
 ('pesanan', 'OTHER'),
 ('khas', 'OTHER'),
 ('kepada', 'OTHER'),
 ('orang', 'OTHER'),
 ('ramai', 'OTHER'),
 ('yang', 'OTHER'),
 ('mahu', 'OTHER'),
 ('pulang', 'OTHER'),
 ('ke', 'OTHER'),
 ('kampung', 'OTHER'),
 ('halaman', 'location'),
 ('masing-masing', 'OTHER'),
 ('.', 'OTHER'),
 ('Dalam', 'OTHER'),
 ('video', 'OTHER'),
 ('pendek', 'OTHER'),
 ('terbitan', 'OTHER'),
 ('Jabatan', 'organization'),
 ('Keselamatan', 'organization'),
 ('Jalan', 'organization'),
 ('Raya', 'organization'),
 ('(', 'organization'),
 ('JKJR', 'organization'),
 (')', 'organization'),
 ('itu', 'OTHER'),
 (',', 'OTHER'),
 ('Dr', 'person'),
 ('Mahathir', 'person'),
 ('menasihati', 'OTHER'),
 ('mereka', 'OTHER'),
 ('supaya', 'OTHER'),
 ('berhenti', 'OTHER'),
 ('berehat', 'OTHER'),
 ('dan', 'OTHER'),
 ('tidur', 'OTHER'),
 ('sebentar', 'OTHER'),
 ('sekiranya', 'OTHER'),
 ('mengantuk', 'OTHER'),
 ('ketika', 'OTHER'),
 ('memandu', 'OTHER'),
 ('.', 'OTHER')]
[37]:
model.predict(string1)
[37]:
[('memperkenalkan', 'OTHER'),
 ('Husein', 'person'),
 (',', 'OTHER'),
 ('dia', 'OTHER'),
 ('sangat', 'OTHER'),
 ('comel', 'OTHER'),
 (',', 'OTHER'),
 ('berumur', 'OTHER'),
 ('25', 'OTHER'),
 ('tahun', 'OTHER'),
 (',', 'OTHER'),
 ('bangsa', 'OTHER'),
 ('melayu', 'person'),
 (',', 'OTHER'),
 ('agama', 'OTHER'),
 ('islam', 'person'),
 (',', 'OTHER'),
 ('tinggal', 'OTHER'),
 ('di', 'OTHER'),
 ('cyberjaya', 'location'),
 ('malaysia', 'location'),
 (',', 'OTHER'),
 ('bercakap', 'OTHER'),
 ('bahasa', 'OTHER'),
 ('melayu', 'person'),
 (',', 'OTHER'),
 ('semua', 'OTHER'),
 ('membaca', 'OTHER'),
 ('buku', 'OTHER'),
 ('undang-undang', 'OTHER'),
 ('kewangan', 'OTHER'),
 (',', 'OTHER'),
 ('dengar', 'OTHER'),
 ('laju', 'OTHER'),
 ('Siti', 'person'),
 ('Nurhaliza', 'person'),
 ('-', 'OTHER'),
 ('Seluruh', 'OTHER'),
 ('Cinta', 'OTHER'),
 ('sambil', 'OTHER'),
 ('makan', 'OTHER'),
 ('ayam', 'OTHER'),
 ('goreng', 'OTHER'),
 ('KFC', 'location')]
[11]:
quantized_model.predict(string)
[11]:
[('KUALA', 'location'),
 ('LUMPUR', 'location'),
 (':', 'OTHER'),
 ('Sempena', 'OTHER'),
 ('sambutan', 'OTHER'),
 ('Aidilfitri', 'event'),
 ('minggu', 'time'),
 ('depan', 'time'),
 (',', 'OTHER'),
 ('Perdana', 'person'),
 ('Menteri', 'person'),
 ('Tun', 'person'),
 ('Dr', 'person'),
 ('Mahathir', 'person'),
 ('Mohamad', 'person'),
 ('dan', 'OTHER'),
 ('Menteri', 'person'),
 ('Pengangkutan', 'person'),
 ('Anthony', 'person'),
 ('Loke', 'person'),
 ('Siew', 'person'),
 ('Fook', 'person'),
 ('menitipkan', 'OTHER'),
 ('pesanan', 'OTHER'),
 ('khas', 'OTHER'),
 ('kepada', 'OTHER'),
 ('orang', 'OTHER'),
 ('ramai', 'OTHER'),
 ('yang', 'OTHER'),
 ('mahu', 'OTHER'),
 ('pulang', 'OTHER'),
 ('ke', 'OTHER'),
 ('kampung', 'OTHER'),
 ('halaman', 'OTHER'),
 ('masing-masing', 'OTHER'),
 ('.', 'OTHER'),
 ('Dalam', 'OTHER'),
 ('video', 'OTHER'),
 ('pendek', 'OTHER'),
 ('terbitan', 'OTHER'),
 ('Jabatan', 'organization'),
 ('Keselamatan', 'organization'),
 ('Jalan', 'organization'),
 ('Raya', 'organization'),
 ('(', 'organization'),
 ('JKJR', 'organization'),
 (')', 'organization'),
 ('itu', 'OTHER'),
 (',', 'OTHER'),
 ('Dr', 'person'),
 ('Mahathir', 'person'),
 ('menasihati', 'OTHER'),
 ('mereka', 'OTHER'),
 ('supaya', 'OTHER'),
 ('berhenti', 'OTHER'),
 ('berehat', 'OTHER'),
 ('dan', 'OTHER'),
 ('tidur', 'OTHER'),
 ('sebentar', 'OTHER'),
 ('sekiranya', 'OTHER'),
 ('mengantuk', 'OTHER'),
 ('ketika', 'OTHER'),
 ('memandu', 'OTHER'),
 ('.', 'OTHER')]
[38]:
quantized_model.predict(string1)
[38]:
[('memperkenalkan', 'OTHER'),
 ('Husein', 'person'),
 (',', 'OTHER'),
 ('dia', 'OTHER'),
 ('sangat', 'OTHER'),
 ('comel', 'OTHER'),
 (',', 'OTHER'),
 ('berumur', 'OTHER'),
 ('25', 'OTHER'),
 ('tahun', 'OTHER'),
 (',', 'OTHER'),
 ('bangsa', 'OTHER'),
 ('melayu', 'person'),
 (',', 'OTHER'),
 ('agama', 'OTHER'),
 ('islam', 'person'),
 (',', 'OTHER'),
 ('tinggal', 'OTHER'),
 ('di', 'OTHER'),
 ('cyberjaya', 'location'),
 ('malaysia', 'location'),
 (',', 'OTHER'),
 ('bercakap', 'OTHER'),
 ('bahasa', 'OTHER'),
 ('melayu', 'person'),
 (',', 'OTHER'),
 ('semua', 'OTHER'),
 ('membaca', 'OTHER'),
 ('buku', 'OTHER'),
 ('undang-undang', 'OTHER'),
 ('kewangan', 'OTHER'),
 (',', 'OTHER'),
 ('dengar', 'OTHER'),
 ('laju', 'OTHER'),
 ('Siti', 'person'),
 ('Nurhaliza', 'person'),
 ('-', 'OTHER'),
 ('Seluruh', 'OTHER'),
 ('Cinta', 'OTHER'),
 ('sambil', 'OTHER'),
 ('makan', 'OTHER'),
 ('ayam', 'OTHER'),
 ('goreng', 'OTHER'),
 ('KFC', 'organization')]

Group similar tags

def analyze(self, string: str):
        """
        Analyze a string.

        Parameters
        ----------
        string : str

        Returns
        -------
        result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location', 'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]}
        """
[13]:
model.analyze(string)
[13]:
[{'text': ['KUALA', 'LUMPUR'],
  'type': 'location',
  'score': 1.0,
  'beginOffset': 0,
  'endOffset': 2},
 {'text': [':', 'Sempena', 'sambutan'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 2,
  'endOffset': 5},
 {'text': ['Aidilfitri'],
  'type': 'event',
  'score': 1.0,
  'beginOffset': 5,
  'endOffset': 6},
 {'text': ['minggu'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 6,
  'endOffset': 7},
 {'text': ['depan'],
  'type': 'time',
  'score': 1.0,
  'beginOffset': 7,
  'endOffset': 8},
 {'text': [','],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 8,
  'endOffset': 9},
 {'text': ['Perdana', 'Menteri', 'Tun', 'Dr', 'Mahathir', 'Mohamad'],
  'type': 'person',
  'score': 1.0,
  'beginOffset': 9,
  'endOffset': 15},
 {'text': ['dan'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 15,
  'endOffset': 16},
 {'text': ['Menteri', 'Pengangkutan'],
  'type': 'organization',
  'score': 1.0,
  'beginOffset': 16,
  'endOffset': 18},
 {'text': ['Anthony', 'Loke', 'Siew', 'Fook'],
  'type': 'person',
  'score': 1.0,
  'beginOffset': 18,
  'endOffset': 22},
 {'text': ['menitipkan',
   'pesanan',
   'khas',
   'kepada',
   'orang',
   'ramai',
   'yang',
   'mahu',
   'pulang',
   'ke',
   'kampung',
   'halaman',
   'masing-masing',
   '.',
   'Dalam',
   'video',
   'pendek',
   'terbitan'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 22,
  'endOffset': 40},
 {'text': ['Jabatan', 'Keselamatan', 'Jalan', 'Raya', '(', 'JKJR', ')'],
  'type': 'organization',
  'score': 1.0,
  'beginOffset': 40,
  'endOffset': 47},
 {'text': ['itu', ','],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 47,
  'endOffset': 49},
 {'text': ['Dr', 'Mahathir'],
  'type': 'person',
  'score': 1.0,
  'beginOffset': 49,
  'endOffset': 51},
 {'text': ['menasihati',
   'mereka',
   'supaya',
   'berhenti',
   'berehat',
   'dan',
   'tidur',
   'sebentar',
   'sekiranya',
   'mengantuk',
   'ketika',
   'memandu',
   '.'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 51,
  'endOffset': 64}]
[39]:
model.analyze(string1)
[39]:
[{'text': ['memperkenalkan'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 0,
  'endOffset': 1},
 {'text': ['Husein'],
  'type': 'person',
  'score': 1.0,
  'beginOffset': 1,
  'endOffset': 2},
 {'text': [',',
   'dia',
   'sangat',
   'comel',
   ',',
   'berumur',
   '25',
   'tahun',
   ',',
   'bangsa'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 2,
  'endOffset': 12},
 {'text': ['melayu'],
  'type': 'person',
  'score': 1.0,
  'beginOffset': 12,
  'endOffset': 13},
 {'text': [',', 'agama'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 13,
  'endOffset': 15},
 {'text': ['islam'],
  'type': 'person',
  'score': 1.0,
  'beginOffset': 15,
  'endOffset': 16},
 {'text': [',', 'tinggal', 'di'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 16,
  'endOffset': 19},
 {'text': ['cyberjaya', 'malaysia'],
  'type': 'location',
  'score': 1.0,
  'beginOffset': 19,
  'endOffset': 21},
 {'text': [',', 'bercakap', 'bahasa'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 21,
  'endOffset': 24},
 {'text': ['melayu'],
  'type': 'person',
  'score': 1.0,
  'beginOffset': 24,
  'endOffset': 25},
 {'text': [',',
   'semua',
   'membaca',
   'buku',
   'undang-undang',
   'kewangan',
   ',',
   'dengar',
   'laju'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 25,
  'endOffset': 34},
 {'text': ['Siti', 'Nurhaliza'],
  'type': 'person',
  'score': 1.0,
  'beginOffset': 34,
  'endOffset': 36},
 {'text': ['-', 'Seluruh', 'Cinta', 'sambil', 'makan', 'ayam', 'goreng'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 36,
  'endOffset': 43},
 {'text': ['KFC'],
  'type': 'organization',
  'score': 1.0,
  'beginOffset': 43,
  'endOffset': 44}]

Vectorize

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str):
    """
    vectorize a string.

    Parameters
    ----------
    string: List[str]

    Returns
    -------
    result: np.array
    """
[15]:
strings = [string,
          'Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais',
          'contact Husein at husein.zol05@gmail.com',
          'tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek']
[16]:
r = [quantized_model.vectorize(string) for string in strings]
[17]:
x, y = [], []
for row in r:
    x.extend([i[0] for i in row])
    y.extend([i[1] for i in row])
[18]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(y)
tsne.shape
[18]:
(124, 2)
[19]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-entities_32_0.png

Pretty good, the model able to know cluster similar entities.

Load Transformer Ontonotes 5 model

def transformer_ontonotes5(
    model: str = 'xlnet', quantized: bool = False, **kwargs
):
    """
    Load Transformer Entity Tagging model trained on Ontonotes 5 Bahasa, transfer learning Transformer + CRF.

    Parameters
    ----------
    model : str, optional (default='bert')
        Model architecture supported. Allowed values:

        * ``'bert'`` - Google BERT BASE parameters.
        * ``'tiny-bert'`` - Google BERT TINY parameters.
        * ``'albert'`` - Google ALBERT BASE parameters.
        * ``'tiny-albert'`` - Google ALBERT TINY parameters.
        * ``'xlnet'`` - Google XLNET BASE parameters.
        * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya.supervised.tag.transformer function
    """
[20]:
albert = malaya.entity.transformer_ontonotes5(model = 'albert')
INFO:root:running entity-ontonotes5/albert using device /device:CPU:0
[21]:
alxlnet = malaya.entity.transformer_ontonotes5(model = 'alxlnet')
INFO:root:running entity-ontonotes5/alxlnet using device /device:CPU:0

Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[22]:
quantized_albert = malaya.entity.transformer_ontonotes5(model = 'albert', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running entity-ontonotes5/albert-quantized using device /device:CPU:0
[23]:
quantized_alxlnet = malaya.entity.transformer_ontonotes5(model = 'alxlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running entity-ontonotes5/alxlnet-quantized using device /device:CPU:0

Predict

def predict(self, string: str):
    """
    Tag a string.

    Parameters
    ----------
    string : str

    Returns
    -------
    result: Tuple[str, str]
    """
[24]:
albert.predict(string)
[24]:
[('KUALA', 'GPE'),
 ('LUMPUR', 'GPE'),
 (':', 'OTHER'),
 ('Sempena', 'OTHER'),
 ('sambutan', 'OTHER'),
 ('Aidilfitri', 'DATE'),
 ('minggu', 'OTHER'),
 ('depan', 'OTHER'),
 (',', 'OTHER'),
 ('Perdana', 'OTHER'),
 ('Menteri', 'OTHER'),
 ('Tun', 'PERSON'),
 ('Dr', 'PERSON'),
 ('Mahathir', 'PERSON'),
 ('Mohamad', 'PERSON'),
 ('dan', 'OTHER'),
 ('Menteri', 'OTHER'),
 ('Pengangkutan', 'OTHER'),
 ('Anthony', 'PERSON'),
 ('Loke', 'PERSON'),
 ('Siew', 'PERSON'),
 ('Fook', 'PERSON'),
 ('menitipkan', 'OTHER'),
 ('pesanan', 'OTHER'),
 ('khas', 'OTHER'),
 ('kepada', 'OTHER'),
 ('orang', 'OTHER'),
 ('ramai', 'OTHER'),
 ('yang', 'OTHER'),
 ('mahu', 'OTHER'),
 ('pulang', 'OTHER'),
 ('ke', 'OTHER'),
 ('kampung', 'OTHER'),
 ('halaman', 'OTHER'),
 ('masing-masing', 'OTHER'),
 ('.', 'OTHER'),
 ('Dalam', 'OTHER'),
 ('video', 'OTHER'),
 ('pendek', 'OTHER'),
 ('terbitan', 'OTHER'),
 ('Jabatan', 'ORG'),
 ('Keselamatan', 'ORG'),
 ('Jalan', 'ORG'),
 ('Raya', 'ORG'),
 ('(', 'ORG'),
 ('JKJR', 'ORG'),
 (')', 'ORG'),
 ('itu', 'OTHER'),
 (',', 'OTHER'),
 ('Dr', 'PERSON'),
 ('Mahathir', 'PERSON'),
 ('menasihati', 'OTHER'),
 ('mereka', 'OTHER'),
 ('supaya', 'OTHER'),
 ('berhenti', 'OTHER'),
 ('berehat', 'OTHER'),
 ('dan', 'OTHER'),
 ('tidur', 'OTHER'),
 ('sebentar', 'OTHER'),
 ('sekiranya', 'OTHER'),
 ('mengantuk', 'OTHER'),
 ('ketika', 'OTHER'),
 ('memandu', 'OTHER'),
 ('.', 'OTHER')]
[25]:
alxlnet.predict(string)
[25]:
[('KUALA', 'EVENT'),
 ('LUMPUR', 'EVENT'),
 (':', 'OTHER'),
 ('Sempena', 'OTHER'),
 ('sambutan', 'DATE'),
 ('Aidilfitri', 'DATE'),
 ('minggu', 'DATE'),
 ('depan', 'DATE'),
 (',', 'OTHER'),
 ('Perdana', 'OTHER'),
 ('Menteri', 'OTHER'),
 ('Tun', 'PERSON'),
 ('Dr', 'PERSON'),
 ('Mahathir', 'PERSON'),
 ('Mohamad', 'PERSON'),
 ('dan', 'OTHER'),
 ('Menteri', 'OTHER'),
 ('Pengangkutan', 'OTHER'),
 ('Anthony', 'PERSON'),
 ('Loke', 'PERSON'),
 ('Siew', 'PERSON'),
 ('Fook', 'PERSON'),
 ('menitipkan', 'OTHER'),
 ('pesanan', 'OTHER'),
 ('khas', 'OTHER'),
 ('kepada', 'OTHER'),
 ('orang', 'OTHER'),
 ('ramai', 'OTHER'),
 ('yang', 'OTHER'),
 ('mahu', 'OTHER'),
 ('pulang', 'OTHER'),
 ('ke', 'OTHER'),
 ('kampung', 'OTHER'),
 ('halaman', 'OTHER'),
 ('masing-masing', 'OTHER'),
 ('.', 'OTHER'),
 ('Dalam', 'OTHER'),
 ('video', 'OTHER'),
 ('pendek', 'OTHER'),
 ('terbitan', 'OTHER'),
 ('Jabatan', 'ORG'),
 ('Keselamatan', 'ORG'),
 ('Jalan', 'ORG'),
 ('Raya', 'ORG'),
 ('(', 'ORG'),
 ('JKJR', 'ORG'),
 (')', 'ORG'),
 ('itu', 'OTHER'),
 (',', 'OTHER'),
 ('Dr', 'OTHER'),
 ('Mahathir', 'PERSON'),
 ('menasihati', 'OTHER'),
 ('mereka', 'OTHER'),
 ('supaya', 'OTHER'),
 ('berhenti', 'OTHER'),
 ('berehat', 'OTHER'),
 ('dan', 'OTHER'),
 ('tidur', 'OTHER'),
 ('sebentar', 'OTHER'),
 ('sekiranya', 'OTHER'),
 ('mengantuk', 'OTHER'),
 ('ketika', 'OTHER'),
 ('memandu', 'OTHER'),
 ('.', 'OTHER')]
[40]:
albert.predict(string1)
[40]:
[('memperkenalkan', 'OTHER'),
 ('Husein', 'PERSON'),
 (',', 'OTHER'),
 ('dia', 'OTHER'),
 ('sangat', 'OTHER'),
 ('comel', 'OTHER'),
 (',', 'OTHER'),
 ('berumur', 'DATE'),
 ('25', 'DATE'),
 ('tahun', 'DATE'),
 (',', 'OTHER'),
 ('bangsa', 'OTHER'),
 ('melayu', 'OTHER'),
 (',', 'OTHER'),
 ('agama', 'OTHER'),
 ('islam', 'OTHER'),
 (',', 'OTHER'),
 ('tinggal', 'OTHER'),
 ('di', 'OTHER'),
 ('cyberjaya', 'GPE'),
 ('malaysia', 'GPE'),
 (',', 'OTHER'),
 ('bercakap', 'OTHER'),
 ('bahasa', 'OTHER'),
 ('melayu', 'OTHER'),
 (',', 'OTHER'),
 ('semua', 'OTHER'),
 ('membaca', 'OTHER'),
 ('buku', 'OTHER'),
 ('undang-undang', 'OTHER'),
 ('kewangan', 'OTHER'),
 (',', 'OTHER'),
 ('dengar', 'OTHER'),
 ('laju', 'OTHER'),
 ('Siti', 'WORK_OF_ART'),
 ('Nurhaliza', 'WORK_OF_ART'),
 ('-', 'WORK_OF_ART'),
 ('Seluruh', 'WORK_OF_ART'),
 ('Cinta', 'WORK_OF_ART'),
 ('sambil', 'OTHER'),
 ('makan', 'OTHER'),
 ('ayam', 'OTHER'),
 ('goreng', 'OTHER'),
 ('KFC', 'ORG')]
[41]:
alxlnet.predict(string1)
[41]:
[('memperkenalkan', 'OTHER'),
 ('Husein', 'PERSON'),
 (',', 'OTHER'),
 ('dia', 'OTHER'),
 ('sangat', 'OTHER'),
 ('comel', 'OTHER'),
 (',', 'OTHER'),
 ('berumur', 'OTHER'),
 ('25', 'DATE'),
 ('tahun', 'DATE'),
 (',', 'OTHER'),
 ('bangsa', 'OTHER'),
 ('melayu', 'OTHER'),
 (',', 'OTHER'),
 ('agama', 'OTHER'),
 ('islam', 'NORP'),
 (',', 'OTHER'),
 ('tinggal', 'OTHER'),
 ('di', 'OTHER'),
 ('cyberjaya', 'GPE'),
 ('malaysia', 'GPE'),
 (',', 'OTHER'),
 ('bercakap', 'OTHER'),
 ('bahasa', 'LANGUAGE'),
 ('melayu', 'LANGUAGE'),
 (',', 'OTHER'),
 ('semua', 'OTHER'),
 ('membaca', 'OTHER'),
 ('buku', 'OTHER'),
 ('undang-undang', 'OTHER'),
 ('kewangan', 'OTHER'),
 (',', 'OTHER'),
 ('dengar', 'OTHER'),
 ('laju', 'OTHER'),
 ('Siti', 'WORK_OF_ART'),
 ('Nurhaliza', 'WORK_OF_ART'),
 ('-', 'WORK_OF_ART'),
 ('Seluruh', 'WORK_OF_ART'),
 ('Cinta', 'WORK_OF_ART'),
 ('sambil', 'OTHER'),
 ('makan', 'OTHER'),
 ('ayam', 'OTHER'),
 ('goreng', 'OTHER'),
 ('KFC', 'OTHER')]
[28]:
quantized_albert.predict(string)
[28]:
[('KUALA', 'GPE'),
 ('LUMPUR', 'GPE'),
 (':', 'OTHER'),
 ('Sempena', 'OTHER'),
 ('sambutan', 'OTHER'),
 ('Aidilfitri', 'DATE'),
 ('minggu', 'OTHER'),
 ('depan', 'OTHER'),
 (',', 'OTHER'),
 ('Perdana', 'OTHER'),
 ('Menteri', 'OTHER'),
 ('Tun', 'PERSON'),
 ('Dr', 'PERSON'),
 ('Mahathir', 'PERSON'),
 ('Mohamad', 'PERSON'),
 ('dan', 'OTHER'),
 ('Menteri', 'OTHER'),
 ('Pengangkutan', 'OTHER'),
 ('Anthony', 'PERSON'),
 ('Loke', 'PERSON'),
 ('Siew', 'PERSON'),
 ('Fook', 'PERSON'),
 ('menitipkan', 'OTHER'),
 ('pesanan', 'OTHER'),
 ('khas', 'OTHER'),
 ('kepada', 'OTHER'),
 ('orang', 'OTHER'),
 ('ramai', 'OTHER'),
 ('yang', 'OTHER'),
 ('mahu', 'OTHER'),
 ('pulang', 'OTHER'),
 ('ke', 'OTHER'),
 ('kampung', 'OTHER'),
 ('halaman', 'OTHER'),
 ('masing-masing', 'OTHER'),
 ('.', 'OTHER'),
 ('Dalam', 'OTHER'),
 ('video', 'OTHER'),
 ('pendek', 'OTHER'),
 ('terbitan', 'OTHER'),
 ('Jabatan', 'ORG'),
 ('Keselamatan', 'ORG'),
 ('Jalan', 'ORG'),
 ('Raya', 'ORG'),
 ('(', 'ORG'),
 ('JKJR', 'ORG'),
 (')', 'ORG'),
 ('itu', 'OTHER'),
 (',', 'OTHER'),
 ('Dr', 'PERSON'),
 ('Mahathir', 'PERSON'),
 ('menasihati', 'OTHER'),
 ('mereka', 'OTHER'),
 ('supaya', 'OTHER'),
 ('berhenti', 'OTHER'),
 ('berehat', 'OTHER'),
 ('dan', 'OTHER'),
 ('tidur', 'OTHER'),
 ('sebentar', 'OTHER'),
 ('sekiranya', 'OTHER'),
 ('mengantuk', 'OTHER'),
 ('ketika', 'OTHER'),
 ('memandu', 'OTHER'),
 ('.', 'OTHER')]
[42]:
quantized_alxlnet.predict(string1)
[42]:
[('memperkenalkan', 'OTHER'),
 ('Husein', 'PERSON'),
 (',', 'OTHER'),
 ('dia', 'OTHER'),
 ('sangat', 'OTHER'),
 ('comel', 'OTHER'),
 (',', 'OTHER'),
 ('berumur', 'DATE'),
 ('25', 'DATE'),
 ('tahun', 'DATE'),
 (',', 'OTHER'),
 ('bangsa', 'OTHER'),
 ('melayu', 'OTHER'),
 (',', 'OTHER'),
 ('agama', 'OTHER'),
 ('islam', 'OTHER'),
 (',', 'OTHER'),
 ('tinggal', 'OTHER'),
 ('di', 'OTHER'),
 ('cyberjaya', 'GPE'),
 ('malaysia', 'GPE'),
 (',', 'OTHER'),
 ('bercakap', 'OTHER'),
 ('bahasa', 'OTHER'),
 ('melayu', 'OTHER'),
 (',', 'OTHER'),
 ('semua', 'OTHER'),
 ('membaca', 'OTHER'),
 ('buku', 'OTHER'),
 ('undang-undang', 'OTHER'),
 ('kewangan', 'OTHER'),
 (',', 'OTHER'),
 ('dengar', 'OTHER'),
 ('laju', 'OTHER'),
 ('Siti', 'WORK_OF_ART'),
 ('Nurhaliza', 'WORK_OF_ART'),
 ('-', 'X'),
 ('Seluruh', 'WORK_OF_ART'),
 ('Cinta', 'WORK_OF_ART'),
 ('sambil', 'OTHER'),
 ('makan', 'OTHER'),
 ('ayam', 'OTHER'),
 ('goreng', 'OTHER'),
 ('KFC', 'OTHER')]

Group similar tags

def analyze(self, string: str):
        """
        Analyze a string.

        Parameters
        ----------
        string : str

        Returns
        -------
        result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location', 'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]}
        """
[30]:
alxlnet.analyze(string1)
[30]:
[{'text': ['memperkenalkan', 'Husein', ',', 'dia', 'sangat', 'comel', ','],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 0,
  'endOffset': 7},
 {'text': ['berumur', '25', 'tahun'],
  'type': 'DATE',
  'score': 1.0,
  'beginOffset': 7,
  'endOffset': 10},
 {'text': [',', 'bangsa', 'melayu', ',', 'agama'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 10,
  'endOffset': 15},
 {'text': ['islam'],
  'type': 'NORP',
  'score': 1.0,
  'beginOffset': 15,
  'endOffset': 16},
 {'text': [',', 'tinggal', 'di'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 16,
  'endOffset': 19},
 {'text': ['cyberjaya'],
  'type': 'GPE',
  'score': 1.0,
  'beginOffset': 19,
  'endOffset': 20},
 {'text': ['malaysia',
   ',',
   'bercakap',
   'bahasa',
   'melayu',
   ',',
   'semua',
   'membaca',
   'buku',
   'undang-undang',
   'kewangan',
   ',',
   'dengar',
   'laju'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 20,
  'endOffset': 34},
 {'text': ['Justin', 'Bieber'],
  'type': 'ORG',
  'score': 1.0,
  'beginOffset': 34,
  'endOffset': 36},
 {'text': ['-', 'Baby'],
  'type': 'X',
  'score': 1.0,
  'beginOffset': 36,
  'endOffset': 38},
 {'text': ['sambil', 'makan', 'ayam', 'goreng'],
  'type': 'OTHER',
  'score': 1.0,
  'beginOffset': 38,
  'endOffset': 42},
 {'text': ['KFC'],
  'type': 'ORG',
  'score': 1.0,
  'beginOffset': 42,
  'endOffset': 43}]

Vectorize

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str):
    """
    vectorize a string.

    Parameters
    ----------
    string: List[str]

    Returns
    -------
    result: np.array
    """
[31]:
strings = [string, string1]
r = [quantized_model.vectorize(string) for string in strings]
[32]:
x, y = [], []
for row in r:
    x.extend([i[0] for i in row])
    y.extend([i[1] for i in row])
[33]:
tsne = TSNE().fit_transform(y)
tsne.shape
[33]:
(107, 2)
[48]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-entities_53_0.png

Pretty good, the model able to know cluster similar entities.

Load general Malaya entity model

This model able to classify,

  1. date

  2. money

  3. temperature

  4. distance

  5. volume

  6. duration

  7. phone

  8. email

  9. url

  10. time

  11. datetime

  12. local and generic foods, can check available rules in malaya.texts._food

  13. local and generic drinks, can check available rules in malaya.texts._food

We can insert BERT or any deep learning model by passing malaya.entity.general_entity(model = model), as long the model has predict method and return [(string, label), (string, label)]. This is an optional.

[32]:
entity = malaya.entity.general_entity(model = model)
[33]:
entity.predict('Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais')
[33]:
{'PERSON': ['Husein'],
 'OTHER': ['baca buku Perlembagaan yang berharga',
  'ringgit dekat kfc sungai petani',
  ', suhu 32 celcius, sambil makan ayam goreng dan milo o ais'],
 'CARDINAL': ['3k'],
 'DATE': ['minggu lepas,', '2019'],
 'TIME': ['2 ptg'],
 'MONEY': ['2 oktober'],
 'date': {'2 oktober 2019': datetime.datetime(2019, 10, 2, 0, 0),
  'minggu lalu': datetime.datetime(2021, 2, 11, 13, 27, 58, 82807)},
 'money': {'3k ringgit': 'RM3000.0'},
 'temperature': ['32 celcius'],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {'2 PM': datetime.datetime(2021, 2, 18, 14, 0)},
 'datetime': {'2 ptg 2 oktober 2019': datetime.datetime(2019, 10, 2, 14, 0)},
 'food': ['ayam goreng'],
 'drink': ['milo o ais'],
 'weight': []}
[34]:
entity.predict('contact Husein at husein.zol05@gmail.com')
[34]:
{'OTHER': ['contact Husein at husein.zol05@gmail.com'],
 'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': ['husein.zol05@gmail.com'],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[35]:
entity.predict('tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek')
[35]:
{'OTHER': ['tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik',
  'dekat'],
 'DATE': ['esok'],
 'ORG': ['Restoran Sebulek'],
 'date': {'esok': datetime.datetime(2021, 2, 19, 13, 27, 58, 505853)},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': ['nasi dagang'],
 'drink': ['milo tarik', 'jus apple'],
 'weight': []}

Voting stack model

[43]:
malaya.stack.voting_stack([albert, alxlnet, alxlnet], string1)
[43]:
[('memperkenalkan', 'OTHER'),
 ('Husein', 'PERSON'),
 (',', 'OTHER'),
 ('dia', 'OTHER'),
 ('sangat', 'OTHER'),
 ('comel', 'OTHER'),
 (',', 'OTHER'),
 ('berumur', 'DATE'),
 ('25', 'DATE'),
 ('tahun', 'DATE'),
 (',', 'OTHER'),
 ('bangsa', 'OTHER'),
 ('melayu', 'OTHER'),
 (',', 'OTHER'),
 ('agama', 'OTHER'),
 ('islam', 'OTHER'),
 (',', 'OTHER'),
 ('tinggal', 'OTHER'),
 ('di', 'OTHER'),
 ('cyberjaya', 'GPE'),
 ('malaysia', 'GPE'),
 (',', 'OTHER'),
 ('bercakap', 'OTHER'),
 ('bahasa', 'OTHER'),
 ('melayu', 'LANGUAGE'),
 (',', 'OTHER'),
 ('semua', 'OTHER'),
 ('membaca', 'OTHER'),
 ('buku', 'OTHER'),
 ('undang-undang', 'OTHER'),
 ('kewangan', 'OTHER'),
 (',', 'OTHER'),
 ('dengar', 'OTHER'),
 ('laju', 'OTHER'),
 ('Siti', 'WORK_OF_ART'),
 ('Nurhaliza', 'WORK_OF_ART'),
 ('-', 'WORK_OF_ART'),
 ('Seluruh', 'WORK_OF_ART'),
 ('Cinta', 'WORK_OF_ART'),
 ('sambil', 'OTHER'),
 ('makan', 'OTHER'),
 ('ayam', 'OTHER'),
 ('goreng', 'OTHER'),
 ('KFC', 'ORG')]