Entities Recognition
Contents
Entities Recognition#
This tutorial is available as an IPython notebook at Malaya/example/entities.
This module only trained on standard language structure, so it is not save to use it for local language structure.
[1]:
import logging
logging.basicConfig(level=logging.INFO)
[2]:
%%time
import malaya
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
CPU times: user 5.75 s, sys: 1.08 s, total: 6.83 s
Wall time: 8.19 s
Describe supported entities#
[3]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
malaya.entity.describe()
[3]:
Tag | Description | |
---|---|---|
0 | OTHER | other |
1 | law | law, regulation, related law documents, documents, etc |
2 | location | location, place |
3 | organization | organization, company, government, facilities, etc |
4 | person | person, group of people, believes, unique arts (eg; food, drink), etc |
5 | quantity | numbers, quantity |
6 | time | date, day, time, etc |
7 | event | unique event happened, etc |
Describe supported Ontonotes 5 entities#
[4]:
malaya.entity.describe_ontonotes5()
[4]:
Tag | Description | |
---|---|---|
0 | OTHER | other |
1 | ADDRESS | Address of physical location. |
2 | PERSON | People, including fictional. |
3 | NORP | Nationalities or religious or political groups. |
4 | FAC | Buildings, airports, highways, bridges, etc. |
5 | ORG | Companies, agencies, institutions, etc. |
6 | GPE | Countries, cities, states. |
7 | LOC | Non-GPE locations, mountain ranges, bodies of water. |
8 | PRODUCT | Objects, vehicles, foods, etc. (Not services.) |
9 | EVENT | Named hurricanes, battles, wars, sports events, etc. |
10 | WORK_OF_ART | Titles of books, songs, etc. |
11 | LAW | Named documents made into laws. |
12 | LANGUAGE | Any named language. |
13 | DATE | Absolute or relative dates or periods. |
14 | TIME | Times smaller than a day. |
15 | PERCENT | Percentage, including "%". |
16 | MONEY | Monetary values, including unit. |
17 | QUANTITY | Measurements, as of weight or distance. |
18 | ORDINAL | "first", "second", etc. |
19 | CARDINAL | Numerals that do not fall under another type. |
List available Transformer NER models#
[5]:
malaya.entity.available_transformer()
INFO:malaya.entity:test set at https://github.com/huseinzol05/Malay-Dataset/tree/master/tagging/entities
[5]:
Size (MB) | Quantized Size (MB) | macro precision | macro recall | macro f1-score | |
---|---|---|---|---|---|
bert | 425.4 | 111.00 | 0.99291 | 0.97864 | 0.98537 |
tiny-bert | 57.7 | 15.40 | 0.98151 | 0.94754 | 0.96134 |
albert | 48.6 | 12.80 | 0.98026 | 0.95332 | 0.96492 |
tiny-albert | 22.4 | 5.98 | 0.96100 | 0.90363 | 0.92374 |
xlnet | 446.6 | 118.00 | 0.99344 | 0.98154 | 0.98725 |
alxlnet | 46.8 | 13.30 | 0.99215 | 0.97575 | 0.98337 |
fastformer | 446.6 | 113.00 | 0.95031 | 0.94018 | 0.94498 |
tiny-fastformer | 77.3 | 19.70 | 0.93574 | 0.89979 | 0.91640 |
List available Transformer NER Ontonotes 5 models#
[6]:
malaya.entity.available_transformer_ontonotes5()
INFO:malaya.entity:test set at https://github.com/huseinzol05/malay-dataset/tree/master/tagging/entities-OntoNotes5
[6]:
Size (MB) | Quantized Size (MB) | macro precision | macro recall | macro f1-score | |
---|---|---|---|---|---|
bert | 425.4 | 111.00 | 0.94460 | 0.93244 | 0.93822 |
tiny-bert | 57.7 | 15.40 | 0.91908 | 0.91635 | 0.91704 |
albert | 48.6 | 12.80 | 0.93010 | 0.92341 | 0.92636 |
tiny-albert | 22.4 | 5.98 | 0.90298 | 0.88251 | 0.89145 |
xlnet | 446.6 | 118.00 | 0.93814 | 0.95021 | 0.94388 |
alxlnet | 46.8 | 13.30 | 0.93244 | 0.92942 | 0.93047 |
fastformer | 446.6 | 113.00 | 0.77486 | 0.67007 | 0.69065 |
tiny-fastformer | 77.3 | 19.70 | 0.68351 | 0.60469 | 0.61678 |
[36]:
string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'
string1 = 'memperkenalkan Husein, dia sangat comel, berumur 25 tahun, bangsa melayu, agama islam, tinggal di cyberjaya malaysia, bercakap bahasa melayu, semua membaca buku undang-undang kewangan, dengar laju Siti Nurhaliza - Seluruh Cinta sambil makan ayam goreng KFC'
Load Transformer model#
def transformer(model: str = 'xlnet', quantized: bool = False, **kwargs):
"""
Load Transformer Entity Tagging model trained on Malaya Entity, transfer learning Transformer + CRF.
Parameters
----------
model : str, optional (default='bert')
Model architecture supported. Allowed values:
* ``'bert'`` - Google BERT BASE parameters.
* ``'tiny-bert'`` - Google BERT TINY parameters.
* ``'albert'`` - Google ALBERT BASE parameters.
* ``'tiny-albert'`` - Google ALBERT TINY parameters.
* ``'xlnet'`` - Google XLNET BASE parameters.
* ``'alxlnet'`` - Malaya ALXLNET BASE parameters.
* ``'fastformer'`` - FastFormer BASE parameters.
* ``'tiny-fastformer'`` - FastFormer TINY parameters.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result: model
List of model classes:
* if `bert` in model, will return `malaya.model.bert.TaggingBERT`.
* if `xlnet` in model, will return `malaya.model.xlnet.TaggingXLNET`.
* if `fastformer` in model, will return `malaya.model.fastformer.TaggingFastFormer`.
"""
[7]:
model = malaya.entity.transformer(model = 'alxlnet')
INFO:root:running entity/alxlnet using device /device:CPU:0
Load Quantized model#
To load 8-bit quantized model, simply pass quantized = True
, default is False
.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[8]:
quantized_model = malaya.entity.transformer(model = 'alxlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running entity/alxlnet-quantized using device /device:CPU:0
Predict#
def predict(self, string: str):
"""
Tag a string.
Parameters
----------
string : str
Returns
-------
result: Tuple[str, str]
"""
[9]:
model.predict(string)
[9]:
[('KUALA', 'location'),
('LUMPUR', 'location'),
(':', 'OTHER'),
('Sempena', 'OTHER'),
('sambutan', 'OTHER'),
('Aidilfitri', 'event'),
('minggu', 'time'),
('depan', 'time'),
(',', 'OTHER'),
('Perdana', 'person'),
('Menteri', 'person'),
('Tun', 'person'),
('Dr', 'person'),
('Mahathir', 'person'),
('Mohamad', 'person'),
('dan', 'OTHER'),
('Menteri', 'organization'),
('Pengangkutan', 'organization'),
('Anthony', 'person'),
('Loke', 'person'),
('Siew', 'person'),
('Fook', 'person'),
('menitipkan', 'OTHER'),
('pesanan', 'OTHER'),
('khas', 'OTHER'),
('kepada', 'OTHER'),
('orang', 'OTHER'),
('ramai', 'OTHER'),
('yang', 'OTHER'),
('mahu', 'OTHER'),
('pulang', 'OTHER'),
('ke', 'OTHER'),
('kampung', 'OTHER'),
('halaman', 'location'),
('masing-masing', 'OTHER'),
('.', 'OTHER'),
('Dalam', 'OTHER'),
('video', 'OTHER'),
('pendek', 'OTHER'),
('terbitan', 'OTHER'),
('Jabatan', 'organization'),
('Keselamatan', 'organization'),
('Jalan', 'organization'),
('Raya', 'organization'),
('(', 'organization'),
('JKJR', 'organization'),
(')', 'organization'),
('itu', 'OTHER'),
(',', 'OTHER'),
('Dr', 'person'),
('Mahathir', 'person'),
('menasihati', 'OTHER'),
('mereka', 'OTHER'),
('supaya', 'OTHER'),
('berhenti', 'OTHER'),
('berehat', 'OTHER'),
('dan', 'OTHER'),
('tidur', 'OTHER'),
('sebentar', 'OTHER'),
('sekiranya', 'OTHER'),
('mengantuk', 'OTHER'),
('ketika', 'OTHER'),
('memandu', 'OTHER'),
('.', 'OTHER')]
[37]:
model.predict(string1)
[37]:
[('memperkenalkan', 'OTHER'),
('Husein', 'person'),
(',', 'OTHER'),
('dia', 'OTHER'),
('sangat', 'OTHER'),
('comel', 'OTHER'),
(',', 'OTHER'),
('berumur', 'OTHER'),
('25', 'OTHER'),
('tahun', 'OTHER'),
(',', 'OTHER'),
('bangsa', 'OTHER'),
('melayu', 'person'),
(',', 'OTHER'),
('agama', 'OTHER'),
('islam', 'person'),
(',', 'OTHER'),
('tinggal', 'OTHER'),
('di', 'OTHER'),
('cyberjaya', 'location'),
('malaysia', 'location'),
(',', 'OTHER'),
('bercakap', 'OTHER'),
('bahasa', 'OTHER'),
('melayu', 'person'),
(',', 'OTHER'),
('semua', 'OTHER'),
('membaca', 'OTHER'),
('buku', 'OTHER'),
('undang-undang', 'OTHER'),
('kewangan', 'OTHER'),
(',', 'OTHER'),
('dengar', 'OTHER'),
('laju', 'OTHER'),
('Siti', 'person'),
('Nurhaliza', 'person'),
('-', 'OTHER'),
('Seluruh', 'OTHER'),
('Cinta', 'OTHER'),
('sambil', 'OTHER'),
('makan', 'OTHER'),
('ayam', 'OTHER'),
('goreng', 'OTHER'),
('KFC', 'location')]
[11]:
quantized_model.predict(string)
[11]:
[('KUALA', 'location'),
('LUMPUR', 'location'),
(':', 'OTHER'),
('Sempena', 'OTHER'),
('sambutan', 'OTHER'),
('Aidilfitri', 'event'),
('minggu', 'time'),
('depan', 'time'),
(',', 'OTHER'),
('Perdana', 'person'),
('Menteri', 'person'),
('Tun', 'person'),
('Dr', 'person'),
('Mahathir', 'person'),
('Mohamad', 'person'),
('dan', 'OTHER'),
('Menteri', 'person'),
('Pengangkutan', 'person'),
('Anthony', 'person'),
('Loke', 'person'),
('Siew', 'person'),
('Fook', 'person'),
('menitipkan', 'OTHER'),
('pesanan', 'OTHER'),
('khas', 'OTHER'),
('kepada', 'OTHER'),
('orang', 'OTHER'),
('ramai', 'OTHER'),
('yang', 'OTHER'),
('mahu', 'OTHER'),
('pulang', 'OTHER'),
('ke', 'OTHER'),
('kampung', 'OTHER'),
('halaman', 'OTHER'),
('masing-masing', 'OTHER'),
('.', 'OTHER'),
('Dalam', 'OTHER'),
('video', 'OTHER'),
('pendek', 'OTHER'),
('terbitan', 'OTHER'),
('Jabatan', 'organization'),
('Keselamatan', 'organization'),
('Jalan', 'organization'),
('Raya', 'organization'),
('(', 'organization'),
('JKJR', 'organization'),
(')', 'organization'),
('itu', 'OTHER'),
(',', 'OTHER'),
('Dr', 'person'),
('Mahathir', 'person'),
('menasihati', 'OTHER'),
('mereka', 'OTHER'),
('supaya', 'OTHER'),
('berhenti', 'OTHER'),
('berehat', 'OTHER'),
('dan', 'OTHER'),
('tidur', 'OTHER'),
('sebentar', 'OTHER'),
('sekiranya', 'OTHER'),
('mengantuk', 'OTHER'),
('ketika', 'OTHER'),
('memandu', 'OTHER'),
('.', 'OTHER')]
[38]:
quantized_model.predict(string1)
[38]:
[('memperkenalkan', 'OTHER'),
('Husein', 'person'),
(',', 'OTHER'),
('dia', 'OTHER'),
('sangat', 'OTHER'),
('comel', 'OTHER'),
(',', 'OTHER'),
('berumur', 'OTHER'),
('25', 'OTHER'),
('tahun', 'OTHER'),
(',', 'OTHER'),
('bangsa', 'OTHER'),
('melayu', 'person'),
(',', 'OTHER'),
('agama', 'OTHER'),
('islam', 'person'),
(',', 'OTHER'),
('tinggal', 'OTHER'),
('di', 'OTHER'),
('cyberjaya', 'location'),
('malaysia', 'location'),
(',', 'OTHER'),
('bercakap', 'OTHER'),
('bahasa', 'OTHER'),
('melayu', 'person'),
(',', 'OTHER'),
('semua', 'OTHER'),
('membaca', 'OTHER'),
('buku', 'OTHER'),
('undang-undang', 'OTHER'),
('kewangan', 'OTHER'),
(',', 'OTHER'),
('dengar', 'OTHER'),
('laju', 'OTHER'),
('Siti', 'person'),
('Nurhaliza', 'person'),
('-', 'OTHER'),
('Seluruh', 'OTHER'),
('Cinta', 'OTHER'),
('sambil', 'OTHER'),
('makan', 'OTHER'),
('ayam', 'OTHER'),
('goreng', 'OTHER'),
('KFC', 'organization')]
Group similar tags#
def analyze(self, string: str):
"""
Analyze a string.
Parameters
----------
string : str
Returns
-------
result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location', 'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]}
"""
[13]:
model.analyze(string)
[13]:
[{'text': ['KUALA', 'LUMPUR'],
'type': 'location',
'score': 1.0,
'beginOffset': 0,
'endOffset': 2},
{'text': [':', 'Sempena', 'sambutan'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 2,
'endOffset': 5},
{'text': ['Aidilfitri'],
'type': 'event',
'score': 1.0,
'beginOffset': 5,
'endOffset': 6},
{'text': ['minggu'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 6,
'endOffset': 7},
{'text': ['depan'],
'type': 'time',
'score': 1.0,
'beginOffset': 7,
'endOffset': 8},
{'text': [','],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 8,
'endOffset': 9},
{'text': ['Perdana', 'Menteri', 'Tun', 'Dr', 'Mahathir', 'Mohamad'],
'type': 'person',
'score': 1.0,
'beginOffset': 9,
'endOffset': 15},
{'text': ['dan'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 15,
'endOffset': 16},
{'text': ['Menteri', 'Pengangkutan'],
'type': 'organization',
'score': 1.0,
'beginOffset': 16,
'endOffset': 18},
{'text': ['Anthony', 'Loke', 'Siew', 'Fook'],
'type': 'person',
'score': 1.0,
'beginOffset': 18,
'endOffset': 22},
{'text': ['menitipkan',
'pesanan',
'khas',
'kepada',
'orang',
'ramai',
'yang',
'mahu',
'pulang',
'ke',
'kampung',
'halaman',
'masing-masing',
'.',
'Dalam',
'video',
'pendek',
'terbitan'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 22,
'endOffset': 40},
{'text': ['Jabatan', 'Keselamatan', 'Jalan', 'Raya', '(', 'JKJR', ')'],
'type': 'organization',
'score': 1.0,
'beginOffset': 40,
'endOffset': 47},
{'text': ['itu', ','],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 47,
'endOffset': 49},
{'text': ['Dr', 'Mahathir'],
'type': 'person',
'score': 1.0,
'beginOffset': 49,
'endOffset': 51},
{'text': ['menasihati',
'mereka',
'supaya',
'berhenti',
'berehat',
'dan',
'tidur',
'sebentar',
'sekiranya',
'mengantuk',
'ketika',
'memandu',
'.'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 51,
'endOffset': 64}]
[39]:
model.analyze(string1)
[39]:
[{'text': ['memperkenalkan'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 0,
'endOffset': 1},
{'text': ['Husein'],
'type': 'person',
'score': 1.0,
'beginOffset': 1,
'endOffset': 2},
{'text': [',',
'dia',
'sangat',
'comel',
',',
'berumur',
'25',
'tahun',
',',
'bangsa'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 2,
'endOffset': 12},
{'text': ['melayu'],
'type': 'person',
'score': 1.0,
'beginOffset': 12,
'endOffset': 13},
{'text': [',', 'agama'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 13,
'endOffset': 15},
{'text': ['islam'],
'type': 'person',
'score': 1.0,
'beginOffset': 15,
'endOffset': 16},
{'text': [',', 'tinggal', 'di'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 16,
'endOffset': 19},
{'text': ['cyberjaya', 'malaysia'],
'type': 'location',
'score': 1.0,
'beginOffset': 19,
'endOffset': 21},
{'text': [',', 'bercakap', 'bahasa'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 21,
'endOffset': 24},
{'text': ['melayu'],
'type': 'person',
'score': 1.0,
'beginOffset': 24,
'endOffset': 25},
{'text': [',',
'semua',
'membaca',
'buku',
'undang-undang',
'kewangan',
',',
'dengar',
'laju'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 25,
'endOffset': 34},
{'text': ['Siti', 'Nurhaliza'],
'type': 'person',
'score': 1.0,
'beginOffset': 34,
'endOffset': 36},
{'text': ['-', 'Seluruh', 'Cinta', 'sambil', 'makan', 'ayam', 'goreng'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 36,
'endOffset': 43},
{'text': ['KFC'],
'type': 'organization',
'score': 1.0,
'beginOffset': 43,
'endOffset': 44}]
Vectorize#
Let say you want to visualize word level in lower dimension, you can use model.vectorize
,
def vectorize(self, string: str):
"""
vectorize a string.
Parameters
----------
string: List[str]
Returns
-------
result: np.array
"""
[15]:
strings = [string,
'Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais',
'contact Husein at husein.zol05@gmail.com',
'tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek']
[16]:
r = [quantized_model.vectorize(string) for string in strings]
[17]:
x, y = [], []
for row in r:
x.extend([i[0] for i in row])
y.extend([i[1] for i in row])
[18]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE().fit_transform(y)
tsne.shape
[18]:
(124, 2)
[19]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
labels, tsne[:, 0], tsne[:, 1]
):
label = (
'%s, %.3f' % (label[0], label[1])
if isinstance(label, list)
else label
)
plt.annotate(
label,
xy = (x, y),
xytext = (0, 0),
textcoords = 'offset points',
)

Pretty good, the model able to know cluster similar entities.
Load Transformer Ontonotes 5 model#
def transformer_ontonotes5(
model: str = 'xlnet', quantized: bool = False, **kwargs
):
"""
Load Transformer Entity Tagging model trained on Ontonotes 5 Bahasa, transfer learning Transformer + CRF.
Parameters
----------
model : str, optional (default='bert')
Model architecture supported. Allowed values:
* ``'bert'`` - Google BERT BASE parameters.
* ``'tiny-bert'`` - Google BERT TINY parameters.
* ``'albert'`` - Google ALBERT BASE parameters.
* ``'tiny-albert'`` - Google ALBERT TINY parameters.
* ``'xlnet'`` - Google XLNET BASE parameters.
* ``'alxlnet'`` - Malaya ALXLNET BASE parameters.
* ``'fastformer'`` - FastFormer BASE parameters.
* ``'tiny-fastformer'`` - FastFormer TINY parameters.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result: model
List of model classes:
* if `bert` in model, will return `malaya.model.bert.TaggingBERT`.
* if `xlnet` in model, will return `malaya.model.xlnet.TaggingXLNET`.
* if `fastformer` in model, will return `malaya.model.fastformer.TaggingFastFormer`.
"""
[20]:
albert = malaya.entity.transformer_ontonotes5(model = 'albert')
INFO:root:running entity-ontonotes5/albert using device /device:CPU:0
[21]:
alxlnet = malaya.entity.transformer_ontonotes5(model = 'alxlnet')
INFO:root:running entity-ontonotes5/alxlnet using device /device:CPU:0
Load Quantized model#
To load 8-bit quantized model, simply pass quantized = True
, default is False
.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[22]:
quantized_albert = malaya.entity.transformer_ontonotes5(model = 'albert', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running entity-ontonotes5/albert-quantized using device /device:CPU:0
[23]:
quantized_alxlnet = malaya.entity.transformer_ontonotes5(model = 'alxlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running entity-ontonotes5/alxlnet-quantized using device /device:CPU:0
Predict#
def predict(self, string: str):
"""
Tag a string.
Parameters
----------
string : str
Returns
-------
result: Tuple[str, str]
"""
[24]:
albert.predict(string)
[24]:
[('KUALA', 'GPE'),
('LUMPUR', 'GPE'),
(':', 'OTHER'),
('Sempena', 'OTHER'),
('sambutan', 'OTHER'),
('Aidilfitri', 'DATE'),
('minggu', 'OTHER'),
('depan', 'OTHER'),
(',', 'OTHER'),
('Perdana', 'OTHER'),
('Menteri', 'OTHER'),
('Tun', 'PERSON'),
('Dr', 'PERSON'),
('Mahathir', 'PERSON'),
('Mohamad', 'PERSON'),
('dan', 'OTHER'),
('Menteri', 'OTHER'),
('Pengangkutan', 'OTHER'),
('Anthony', 'PERSON'),
('Loke', 'PERSON'),
('Siew', 'PERSON'),
('Fook', 'PERSON'),
('menitipkan', 'OTHER'),
('pesanan', 'OTHER'),
('khas', 'OTHER'),
('kepada', 'OTHER'),
('orang', 'OTHER'),
('ramai', 'OTHER'),
('yang', 'OTHER'),
('mahu', 'OTHER'),
('pulang', 'OTHER'),
('ke', 'OTHER'),
('kampung', 'OTHER'),
('halaman', 'OTHER'),
('masing-masing', 'OTHER'),
('.', 'OTHER'),
('Dalam', 'OTHER'),
('video', 'OTHER'),
('pendek', 'OTHER'),
('terbitan', 'OTHER'),
('Jabatan', 'ORG'),
('Keselamatan', 'ORG'),
('Jalan', 'ORG'),
('Raya', 'ORG'),
('(', 'ORG'),
('JKJR', 'ORG'),
(')', 'ORG'),
('itu', 'OTHER'),
(',', 'OTHER'),
('Dr', 'PERSON'),
('Mahathir', 'PERSON'),
('menasihati', 'OTHER'),
('mereka', 'OTHER'),
('supaya', 'OTHER'),
('berhenti', 'OTHER'),
('berehat', 'OTHER'),
('dan', 'OTHER'),
('tidur', 'OTHER'),
('sebentar', 'OTHER'),
('sekiranya', 'OTHER'),
('mengantuk', 'OTHER'),
('ketika', 'OTHER'),
('memandu', 'OTHER'),
('.', 'OTHER')]
[25]:
alxlnet.predict(string)
[25]:
[('KUALA', 'EVENT'),
('LUMPUR', 'EVENT'),
(':', 'OTHER'),
('Sempena', 'OTHER'),
('sambutan', 'DATE'),
('Aidilfitri', 'DATE'),
('minggu', 'DATE'),
('depan', 'DATE'),
(',', 'OTHER'),
('Perdana', 'OTHER'),
('Menteri', 'OTHER'),
('Tun', 'PERSON'),
('Dr', 'PERSON'),
('Mahathir', 'PERSON'),
('Mohamad', 'PERSON'),
('dan', 'OTHER'),
('Menteri', 'OTHER'),
('Pengangkutan', 'OTHER'),
('Anthony', 'PERSON'),
('Loke', 'PERSON'),
('Siew', 'PERSON'),
('Fook', 'PERSON'),
('menitipkan', 'OTHER'),
('pesanan', 'OTHER'),
('khas', 'OTHER'),
('kepada', 'OTHER'),
('orang', 'OTHER'),
('ramai', 'OTHER'),
('yang', 'OTHER'),
('mahu', 'OTHER'),
('pulang', 'OTHER'),
('ke', 'OTHER'),
('kampung', 'OTHER'),
('halaman', 'OTHER'),
('masing-masing', 'OTHER'),
('.', 'OTHER'),
('Dalam', 'OTHER'),
('video', 'OTHER'),
('pendek', 'OTHER'),
('terbitan', 'OTHER'),
('Jabatan', 'ORG'),
('Keselamatan', 'ORG'),
('Jalan', 'ORG'),
('Raya', 'ORG'),
('(', 'ORG'),
('JKJR', 'ORG'),
(')', 'ORG'),
('itu', 'OTHER'),
(',', 'OTHER'),
('Dr', 'OTHER'),
('Mahathir', 'PERSON'),
('menasihati', 'OTHER'),
('mereka', 'OTHER'),
('supaya', 'OTHER'),
('berhenti', 'OTHER'),
('berehat', 'OTHER'),
('dan', 'OTHER'),
('tidur', 'OTHER'),
('sebentar', 'OTHER'),
('sekiranya', 'OTHER'),
('mengantuk', 'OTHER'),
('ketika', 'OTHER'),
('memandu', 'OTHER'),
('.', 'OTHER')]
[40]:
albert.predict(string1)
[40]:
[('memperkenalkan', 'OTHER'),
('Husein', 'PERSON'),
(',', 'OTHER'),
('dia', 'OTHER'),
('sangat', 'OTHER'),
('comel', 'OTHER'),
(',', 'OTHER'),
('berumur', 'DATE'),
('25', 'DATE'),
('tahun', 'DATE'),
(',', 'OTHER'),
('bangsa', 'OTHER'),
('melayu', 'OTHER'),
(',', 'OTHER'),
('agama', 'OTHER'),
('islam', 'OTHER'),
(',', 'OTHER'),
('tinggal', 'OTHER'),
('di', 'OTHER'),
('cyberjaya', 'GPE'),
('malaysia', 'GPE'),
(',', 'OTHER'),
('bercakap', 'OTHER'),
('bahasa', 'OTHER'),
('melayu', 'OTHER'),
(',', 'OTHER'),
('semua', 'OTHER'),
('membaca', 'OTHER'),
('buku', 'OTHER'),
('undang-undang', 'OTHER'),
('kewangan', 'OTHER'),
(',', 'OTHER'),
('dengar', 'OTHER'),
('laju', 'OTHER'),
('Siti', 'WORK_OF_ART'),
('Nurhaliza', 'WORK_OF_ART'),
('-', 'WORK_OF_ART'),
('Seluruh', 'WORK_OF_ART'),
('Cinta', 'WORK_OF_ART'),
('sambil', 'OTHER'),
('makan', 'OTHER'),
('ayam', 'OTHER'),
('goreng', 'OTHER'),
('KFC', 'ORG')]
[41]:
alxlnet.predict(string1)
[41]:
[('memperkenalkan', 'OTHER'),
('Husein', 'PERSON'),
(',', 'OTHER'),
('dia', 'OTHER'),
('sangat', 'OTHER'),
('comel', 'OTHER'),
(',', 'OTHER'),
('berumur', 'OTHER'),
('25', 'DATE'),
('tahun', 'DATE'),
(',', 'OTHER'),
('bangsa', 'OTHER'),
('melayu', 'OTHER'),
(',', 'OTHER'),
('agama', 'OTHER'),
('islam', 'NORP'),
(',', 'OTHER'),
('tinggal', 'OTHER'),
('di', 'OTHER'),
('cyberjaya', 'GPE'),
('malaysia', 'GPE'),
(',', 'OTHER'),
('bercakap', 'OTHER'),
('bahasa', 'LANGUAGE'),
('melayu', 'LANGUAGE'),
(',', 'OTHER'),
('semua', 'OTHER'),
('membaca', 'OTHER'),
('buku', 'OTHER'),
('undang-undang', 'OTHER'),
('kewangan', 'OTHER'),
(',', 'OTHER'),
('dengar', 'OTHER'),
('laju', 'OTHER'),
('Siti', 'WORK_OF_ART'),
('Nurhaliza', 'WORK_OF_ART'),
('-', 'WORK_OF_ART'),
('Seluruh', 'WORK_OF_ART'),
('Cinta', 'WORK_OF_ART'),
('sambil', 'OTHER'),
('makan', 'OTHER'),
('ayam', 'OTHER'),
('goreng', 'OTHER'),
('KFC', 'OTHER')]
[28]:
quantized_albert.predict(string)
[28]:
[('KUALA', 'GPE'),
('LUMPUR', 'GPE'),
(':', 'OTHER'),
('Sempena', 'OTHER'),
('sambutan', 'OTHER'),
('Aidilfitri', 'DATE'),
('minggu', 'OTHER'),
('depan', 'OTHER'),
(',', 'OTHER'),
('Perdana', 'OTHER'),
('Menteri', 'OTHER'),
('Tun', 'PERSON'),
('Dr', 'PERSON'),
('Mahathir', 'PERSON'),
('Mohamad', 'PERSON'),
('dan', 'OTHER'),
('Menteri', 'OTHER'),
('Pengangkutan', 'OTHER'),
('Anthony', 'PERSON'),
('Loke', 'PERSON'),
('Siew', 'PERSON'),
('Fook', 'PERSON'),
('menitipkan', 'OTHER'),
('pesanan', 'OTHER'),
('khas', 'OTHER'),
('kepada', 'OTHER'),
('orang', 'OTHER'),
('ramai', 'OTHER'),
('yang', 'OTHER'),
('mahu', 'OTHER'),
('pulang', 'OTHER'),
('ke', 'OTHER'),
('kampung', 'OTHER'),
('halaman', 'OTHER'),
('masing-masing', 'OTHER'),
('.', 'OTHER'),
('Dalam', 'OTHER'),
('video', 'OTHER'),
('pendek', 'OTHER'),
('terbitan', 'OTHER'),
('Jabatan', 'ORG'),
('Keselamatan', 'ORG'),
('Jalan', 'ORG'),
('Raya', 'ORG'),
('(', 'ORG'),
('JKJR', 'ORG'),
(')', 'ORG'),
('itu', 'OTHER'),
(',', 'OTHER'),
('Dr', 'PERSON'),
('Mahathir', 'PERSON'),
('menasihati', 'OTHER'),
('mereka', 'OTHER'),
('supaya', 'OTHER'),
('berhenti', 'OTHER'),
('berehat', 'OTHER'),
('dan', 'OTHER'),
('tidur', 'OTHER'),
('sebentar', 'OTHER'),
('sekiranya', 'OTHER'),
('mengantuk', 'OTHER'),
('ketika', 'OTHER'),
('memandu', 'OTHER'),
('.', 'OTHER')]
[42]:
quantized_alxlnet.predict(string1)
[42]:
[('memperkenalkan', 'OTHER'),
('Husein', 'PERSON'),
(',', 'OTHER'),
('dia', 'OTHER'),
('sangat', 'OTHER'),
('comel', 'OTHER'),
(',', 'OTHER'),
('berumur', 'DATE'),
('25', 'DATE'),
('tahun', 'DATE'),
(',', 'OTHER'),
('bangsa', 'OTHER'),
('melayu', 'OTHER'),
(',', 'OTHER'),
('agama', 'OTHER'),
('islam', 'OTHER'),
(',', 'OTHER'),
('tinggal', 'OTHER'),
('di', 'OTHER'),
('cyberjaya', 'GPE'),
('malaysia', 'GPE'),
(',', 'OTHER'),
('bercakap', 'OTHER'),
('bahasa', 'OTHER'),
('melayu', 'OTHER'),
(',', 'OTHER'),
('semua', 'OTHER'),
('membaca', 'OTHER'),
('buku', 'OTHER'),
('undang-undang', 'OTHER'),
('kewangan', 'OTHER'),
(',', 'OTHER'),
('dengar', 'OTHER'),
('laju', 'OTHER'),
('Siti', 'WORK_OF_ART'),
('Nurhaliza', 'WORK_OF_ART'),
('-', 'X'),
('Seluruh', 'WORK_OF_ART'),
('Cinta', 'WORK_OF_ART'),
('sambil', 'OTHER'),
('makan', 'OTHER'),
('ayam', 'OTHER'),
('goreng', 'OTHER'),
('KFC', 'OTHER')]
Group similar tags#
def analyze(self, string: str):
"""
Analyze a string.
Parameters
----------
string : str
Returns
-------
result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location', 'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]}
"""
[30]:
alxlnet.analyze(string1)
[30]:
[{'text': ['memperkenalkan', 'Husein', ',', 'dia', 'sangat', 'comel', ','],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 0,
'endOffset': 7},
{'text': ['berumur', '25', 'tahun'],
'type': 'DATE',
'score': 1.0,
'beginOffset': 7,
'endOffset': 10},
{'text': [',', 'bangsa', 'melayu', ',', 'agama'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 10,
'endOffset': 15},
{'text': ['islam'],
'type': 'NORP',
'score': 1.0,
'beginOffset': 15,
'endOffset': 16},
{'text': [',', 'tinggal', 'di'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 16,
'endOffset': 19},
{'text': ['cyberjaya'],
'type': 'GPE',
'score': 1.0,
'beginOffset': 19,
'endOffset': 20},
{'text': ['malaysia',
',',
'bercakap',
'bahasa',
'melayu',
',',
'semua',
'membaca',
'buku',
'undang-undang',
'kewangan',
',',
'dengar',
'laju'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 20,
'endOffset': 34},
{'text': ['Justin', 'Bieber'],
'type': 'ORG',
'score': 1.0,
'beginOffset': 34,
'endOffset': 36},
{'text': ['-', 'Baby'],
'type': 'X',
'score': 1.0,
'beginOffset': 36,
'endOffset': 38},
{'text': ['sambil', 'makan', 'ayam', 'goreng'],
'type': 'OTHER',
'score': 1.0,
'beginOffset': 38,
'endOffset': 42},
{'text': ['KFC'],
'type': 'ORG',
'score': 1.0,
'beginOffset': 42,
'endOffset': 43}]
Vectorize#
Let say you want to visualize word level in lower dimension, you can use model.vectorize
,
def vectorize(self, string: str):
"""
vectorize a string.
Parameters
----------
string: List[str]
Returns
-------
result: np.array
"""
[31]:
strings = [string, string1]
r = [quantized_model.vectorize(string) for string in strings]
[32]:
x, y = [], []
for row in r:
x.extend([i[0] for i in row])
y.extend([i[1] for i in row])
[33]:
tsne = TSNE().fit_transform(y)
tsne.shape
[33]:
(107, 2)
[48]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
labels, tsne[:, 0], tsne[:, 1]
):
label = (
'%s, %.3f' % (label[0], label[1])
if isinstance(label, list)
else label
)
plt.annotate(
label,
xy = (x, y),
xytext = (0, 0),
textcoords = 'offset points',
)

Pretty good, the model able to know cluster similar entities.
Load general Malaya entity model#
This model able to classify,
date
money
temperature
distance
volume
duration
phone
email
url
time
datetime
local and generic foods, can check available rules in malaya.texts._food
local and generic drinks, can check available rules in malaya.texts._food
We can insert BERT or any deep learning model by passing malaya.entity.general_entity(model = model)
, as long the model has predict
method and return [(string, label), (string, label)]
. This is an optional.
[32]:
entity = malaya.entity.general_entity(model = model)
[33]:
entity.predict('Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais')
[33]:
{'PERSON': ['Husein'],
'OTHER': ['baca buku Perlembagaan yang berharga',
'ringgit dekat kfc sungai petani',
', suhu 32 celcius, sambil makan ayam goreng dan milo o ais'],
'CARDINAL': ['3k'],
'DATE': ['minggu lepas,', '2019'],
'TIME': ['2 ptg'],
'MONEY': ['2 oktober'],
'date': {'2 oktober 2019': datetime.datetime(2019, 10, 2, 0, 0),
'minggu lalu': datetime.datetime(2021, 2, 11, 13, 27, 58, 82807)},
'money': {'3k ringgit': 'RM3000.0'},
'temperature': ['32 celcius'],
'distance': [],
'volume': [],
'duration': [],
'phone': [],
'email': [],
'url': [],
'time': {'2 PM': datetime.datetime(2021, 2, 18, 14, 0)},
'datetime': {'2 ptg 2 oktober 2019': datetime.datetime(2019, 10, 2, 14, 0)},
'food': ['ayam goreng'],
'drink': ['milo o ais'],
'weight': []}
[34]:
entity.predict('contact Husein at husein.zol05@gmail.com')
[34]:
{'OTHER': ['contact Husein at husein.zol05@gmail.com'],
'date': {},
'money': {},
'temperature': [],
'distance': [],
'volume': [],
'duration': [],
'phone': [],
'email': ['husein.zol05@gmail.com'],
'url': [],
'time': {},
'datetime': {},
'food': [],
'drink': [],
'weight': []}
[35]:
entity.predict('tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek')
[35]:
{'OTHER': ['tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik',
'dekat'],
'DATE': ['esok'],
'ORG': ['Restoran Sebulek'],
'date': {'esok': datetime.datetime(2021, 2, 19, 13, 27, 58, 505853)},
'money': {},
'temperature': [],
'distance': [],
'volume': [],
'duration': [],
'phone': [],
'email': [],
'url': [],
'time': {},
'datetime': {},
'food': ['nasi dagang'],
'drink': ['milo tarik', 'jus apple'],
'weight': []}
Voting stack model#
[43]:
malaya.stack.voting_stack([albert, alxlnet, alxlnet], string1)
[43]:
[('memperkenalkan', 'OTHER'),
('Husein', 'PERSON'),
(',', 'OTHER'),
('dia', 'OTHER'),
('sangat', 'OTHER'),
('comel', 'OTHER'),
(',', 'OTHER'),
('berumur', 'DATE'),
('25', 'DATE'),
('tahun', 'DATE'),
(',', 'OTHER'),
('bangsa', 'OTHER'),
('melayu', 'OTHER'),
(',', 'OTHER'),
('agama', 'OTHER'),
('islam', 'OTHER'),
(',', 'OTHER'),
('tinggal', 'OTHER'),
('di', 'OTHER'),
('cyberjaya', 'GPE'),
('malaysia', 'GPE'),
(',', 'OTHER'),
('bercakap', 'OTHER'),
('bahasa', 'OTHER'),
('melayu', 'LANGUAGE'),
(',', 'OTHER'),
('semua', 'OTHER'),
('membaca', 'OTHER'),
('buku', 'OTHER'),
('undang-undang', 'OTHER'),
('kewangan', 'OTHER'),
(',', 'OTHER'),
('dengar', 'OTHER'),
('laju', 'OTHER'),
('Siti', 'WORK_OF_ART'),
('Nurhaliza', 'WORK_OF_ART'),
('-', 'WORK_OF_ART'),
('Seluruh', 'WORK_OF_ART'),
('Cinta', 'WORK_OF_ART'),
('sambil', 'OTHER'),
('makan', 'OTHER'),
('ayam', 'OTHER'),
('goreng', 'OTHER'),
('KFC', 'ORG')]