Part-of-Speech Recognition¶
This tutorial is available as an IPython notebook at Malaya/example/part-of-speech.
This module only trained on standard language structure, so it is not save to use it for local language structure.
[1]:
%%time
import malaya
CPU times: user 5.24 s, sys: 1.15 s, total: 6.39 s
Wall time: 7.91 s
Describe supported POS¶
[2]:
malaya.pos.describe()
[2]:
Tag | Description | |
---|---|---|
0 | ADJ | Adjective, kata sifat |
1 | ADP | Adposition |
2 | ADV | Adverb, kata keterangan |
3 | ADX | Auxiliary verb, kata kerja tambahan |
4 | CCONJ | Coordinating conjuction, kata hubung |
5 | DET | Determiner, kata penentu |
6 | NOUN | Noun, kata nama |
7 | NUM | Number, nombor |
8 | PART | Particle |
9 | PRON | Pronoun, kata ganti |
10 | PROPN | Proper noun, kata ganti nama khas |
11 | SCONJ | Subordinating conjunction |
12 | SYM | Symbol |
13 | VERB | Verb, kata kerja |
14 | X | Other |
List available Transformer POS models¶
[3]:
malaya.pos.available_transformer()
INFO:root:tested on 20% test set.
[3]:
Size (MB) | Quantized Size (MB) | Accuracy | |
---|---|---|---|
bert | 426.4 | 111.00 | 0.952 |
tiny-bert | 57.7 | 15.40 | 0.953 |
albert | 48.7 | 12.80 | 0.951 |
tiny-albert | 22.4 | 5.98 | 0.933 |
xlnet | 446.6 | 118.00 | 0.954 |
alxlnet | 46.8 | 13.30 | 0.951 |
Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/Accuracy.html#pos-recognition
You might want to use Tiny-Albert, a very small size, 22.4MB, but the accuracy is still on the top notch.
[4]:
string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'
Load ALBERT model¶
[6]:
model = malaya.pos.transformer(model = 'albert')
INFO:tensorflow:loading sentence piece model
Load Quantized model¶
To load 8-bit quantized model, simply pass quantized = True
, default is False
.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[6]:
quantized_model = malaya.pos.transformer(model = 'albert', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
INFO:tensorflow:loading sentence piece model
INFO:tensorflow:loading sentence piece model
[7]:
model.predict(string)
[7]:
[('Kuala', 'PROPN'),
('Lumpur:', 'PROPN'),
('Sempena', 'ADP'),
('sambutan', 'NOUN'),
('Aidilfitri', 'NOUN'),
('minggu', 'NOUN'),
('depan,', 'ADJ'),
('Perdana', 'PROPN'),
('Menteri', 'PROPN'),
('Tun', 'PROPN'),
('Dr', 'PROPN'),
('Mahathir', 'PROPN'),
('Mohamad', 'PROPN'),
('dan', 'CCONJ'),
('Menteri', 'PROPN'),
('Pengangkutan', 'PROPN'),
('Anthony', 'PROPN'),
('Loke', 'PROPN'),
('Siew', 'PROPN'),
('Fook', 'PROPN'),
('menitipkan', 'VERB'),
('pesanan', 'NOUN'),
('khas', 'ADJ'),
('kepada', 'ADP'),
('orang', 'NOUN'),
('ramai', 'ADJ'),
('yang', 'PRON'),
('mahu', 'ADV'),
('pulang', 'VERB'),
('ke', 'ADP'),
('kampung', 'NOUN'),
('halaman', 'NOUN'),
('masing-masing.', 'DET'),
('Dalam', 'ADP'),
('video', 'NOUN'),
('pendek', 'ADJ'),
('terbitan', 'NOUN'),
('Jabatan', 'PROPN'),
('Keselamatan', 'PROPN'),
('Jalan', 'PROPN'),
('Raya', 'PROPN'),
('(JKJR)', 'PUNCT'),
('itu,', 'DET'),
('Dr', 'PROPN'),
('Mahathir', 'PROPN'),
('menasihati', 'VERB'),
('mereka', 'PRON'),
('supaya', 'SCONJ'),
('berhenti', 'VERB'),
('berehat', 'VERB'),
('dan', 'CCONJ'),
('tidur', 'VERB'),
('sebentar', 'NOUN'),
('sekiranya', 'SCONJ'),
('mengantuk', 'ADJ'),
('ketika', 'SCONJ'),
('memandu.', 'VERB')]
[7]:
quantized_model.predict(string)
[7]:
[('KUALA', 'PROPN'),
('LUMPUR:', 'PROPN'),
('Sempena', 'ADP'),
('sambutan', 'NOUN'),
('Aidilfitri', 'NOUN'),
('minggu', 'NOUN'),
('depan,', 'ADJ'),
('Perdana', 'PROPN'),
('Menteri', 'PROPN'),
('Tun', 'PROPN'),
('Dr', 'PROPN'),
('Mahathir', 'PROPN'),
('Mohamad', 'PROPN'),
('dan', 'CCONJ'),
('Menteri', 'PROPN'),
('Pengangkutan', 'PROPN'),
('Anthony', 'PROPN'),
('Loke', 'PROPN'),
('Siew', 'PROPN'),
('Fook', 'PROPN'),
('menitipkan', 'VERB'),
('pesanan', 'NOUN'),
('khas', 'ADJ'),
('kepada', 'ADP'),
('orang', 'NOUN'),
('ramai', 'ADJ'),
('yang', 'PRON'),
('mahu', 'ADV'),
('pulang', 'VERB'),
('ke', 'ADP'),
('kampung', 'NOUN'),
('halaman', 'NOUN'),
('masing-masing.', 'DET'),
('Dalam', 'ADP'),
('video', 'NOUN'),
('pendek', 'ADJ'),
('terbitan', 'NOUN'),
('Jabatan', 'PROPN'),
('Keselamatan', 'PROPN'),
('Jalan', 'PROPN'),
('Raya', 'PROPN'),
('(JKJR)', 'PUNCT'),
('itu,', 'DET'),
('Dr', 'PROPN'),
('Mahathir', 'PROPN'),
('menasihati', 'VERB'),
('mereka', 'PRON'),
('supaya', 'SCONJ'),
('berhenti', 'VERB'),
('berehat', 'VERB'),
('dan', 'CCONJ'),
('tidur', 'VERB'),
('sebentar', 'NOUN'),
('sekiranya', 'SCONJ'),
('mengantuk', 'ADJ'),
('ketika', 'SCONJ'),
('memandu.', 'VERB')]
[8]:
model.analyze(string)
[8]:
{'words': ['Kuala',
'Lumpur:',
'Sempena',
'sambutan',
'Aidilfitri',
'minggu',
'depan,',
'Perdana',
'Menteri',
'Tun',
'Dr',
'Mahathir',
'Mohamad',
'dan',
'Menteri',
'Pengangkutan',
'Anthony',
'Loke',
'Siew',
'Fook',
'menitipkan',
'pesanan',
'khas',
'kepada',
'orang',
'ramai',
'yang',
'mahu',
'pulang',
'ke',
'kampung',
'halaman',
'masing-masing.',
'Dalam',
'video',
'pendek',
'terbitan',
'Jabatan',
'Keselamatan',
'Jalan',
'Raya',
'(JKJR)',
'itu,',
'Dr',
'Mahathir',
'menasihati',
'mereka',
'supaya',
'berhenti',
'berehat',
'dan',
'tidur',
'sebentar',
'sekiranya',
'mengantuk',
'ketika',
'memandu.'],
'tags': [{'text': 'Kuala Lumpur:',
'type': 'PROPN',
'score': 1.0,
'beginOffset': 0,
'endOffset': 1},
{'text': 'Sempena',
'type': 'ADP',
'score': 1.0,
'beginOffset': 2,
'endOffset': 2},
{'text': 'sambutan Aidilfitri minggu',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 3,
'endOffset': 5},
{'text': 'depan,',
'type': 'ADJ',
'score': 1.0,
'beginOffset': 6,
'endOffset': 6},
{'text': 'Perdana Menteri Tun Dr Mahathir Mohamad',
'type': 'PROPN',
'score': 1.0,
'beginOffset': 7,
'endOffset': 12},
{'text': 'dan',
'type': 'CCONJ',
'score': 1.0,
'beginOffset': 13,
'endOffset': 13},
{'text': 'Menteri Pengangkutan Anthony Loke Siew Fook',
'type': 'PROPN',
'score': 1.0,
'beginOffset': 14,
'endOffset': 19},
{'text': 'menitipkan',
'type': 'VERB',
'score': 1.0,
'beginOffset': 20,
'endOffset': 20},
{'text': 'pesanan',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 21,
'endOffset': 21},
{'text': 'khas',
'type': 'ADJ',
'score': 1.0,
'beginOffset': 22,
'endOffset': 22},
{'text': 'kepada',
'type': 'ADP',
'score': 1.0,
'beginOffset': 23,
'endOffset': 23},
{'text': 'orang',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 24,
'endOffset': 24},
{'text': 'ramai',
'type': 'ADJ',
'score': 1.0,
'beginOffset': 25,
'endOffset': 25},
{'text': 'yang',
'type': 'PRON',
'score': 1.0,
'beginOffset': 26,
'endOffset': 26},
{'text': 'mahu',
'type': 'ADV',
'score': 1.0,
'beginOffset': 27,
'endOffset': 27},
{'text': 'pulang',
'type': 'VERB',
'score': 1.0,
'beginOffset': 28,
'endOffset': 28},
{'text': 'ke',
'type': 'ADP',
'score': 1.0,
'beginOffset': 29,
'endOffset': 29},
{'text': 'kampung halaman',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 30,
'endOffset': 31},
{'text': 'masing-masing.',
'type': 'DET',
'score': 1.0,
'beginOffset': 32,
'endOffset': 32},
{'text': 'Dalam',
'type': 'ADP',
'score': 1.0,
'beginOffset': 33,
'endOffset': 33},
{'text': 'video',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 34,
'endOffset': 34},
{'text': 'pendek',
'type': 'ADJ',
'score': 1.0,
'beginOffset': 35,
'endOffset': 35},
{'text': 'terbitan',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 36,
'endOffset': 36},
{'text': 'Jabatan Keselamatan Jalan Raya',
'type': 'PROPN',
'score': 1.0,
'beginOffset': 37,
'endOffset': 40},
{'text': '(JKJR)',
'type': 'PUNCT',
'score': 1.0,
'beginOffset': 41,
'endOffset': 41},
{'text': 'itu,',
'type': 'DET',
'score': 1.0,
'beginOffset': 42,
'endOffset': 42},
{'text': 'Dr Mahathir',
'type': 'PROPN',
'score': 1.0,
'beginOffset': 43,
'endOffset': 44},
{'text': 'menasihati',
'type': 'VERB',
'score': 1.0,
'beginOffset': 45,
'endOffset': 45},
{'text': 'mereka',
'type': 'PRON',
'score': 1.0,
'beginOffset': 46,
'endOffset': 46},
{'text': 'supaya',
'type': 'SCONJ',
'score': 1.0,
'beginOffset': 47,
'endOffset': 47},
{'text': 'berhenti berehat',
'type': 'VERB',
'score': 1.0,
'beginOffset': 48,
'endOffset': 49},
{'text': 'dan',
'type': 'CCONJ',
'score': 1.0,
'beginOffset': 50,
'endOffset': 50},
{'text': 'tidur',
'type': 'VERB',
'score': 1.0,
'beginOffset': 51,
'endOffset': 51},
{'text': 'sebentar',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 52,
'endOffset': 52},
{'text': 'sekiranya',
'type': 'SCONJ',
'score': 1.0,
'beginOffset': 53,
'endOffset': 53},
{'text': 'mengantuk',
'type': 'ADJ',
'score': 1.0,
'beginOffset': 54,
'endOffset': 54},
{'text': 'ketika',
'type': 'SCONJ',
'score': 1.0,
'beginOffset': 55,
'endOffset': 55}]}
Vectorize¶
Let say you want to visualize word level in lower dimension, you can use model.vectorize
,
def vectorize(self, string: str):
"""
vectorize a string.
Parameters
----------
string: List[str]
Returns
-------
result: np.array
"""
[8]:
strings = [string,
'Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais',
'contact Husein at husein.zol05@gmail.com',
'tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek']
[9]:
r = [quantized_model.vectorize(string) for string in strings]
[10]:
x, y = [], []
for row in r:
x.extend([i[0] for i in row])
y.extend([i[1] for i in row])
[11]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE().fit_transform(y)
tsne.shape
[11]:
(108, 2)
[12]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
labels, tsne[:, 0], tsne[:, 1]
):
label = (
'%s, %.3f' % (label[0], label[1])
if isinstance(label, list)
else label
)
plt.annotate(
label,
xy = (x, y),
xytext = (0, 0),
textcoords = 'offset points',
)

Pretty good, the model able to know cluster similar part-of-speech.
Voting stack model¶
[ ]:
alxlnet = malaya.pos.transformer(model = 'alxlnet')
malaya.stack.voting_stack([model, alxlnet, alxlnet], string)