Part-of-Speech Recognition
Contents
Part-of-Speech Recognition#
This tutorial is available as an IPython notebook at Malaya/example/part-of-speech.
This module only trained on standard language structure, so it is not save to use it for local language structure.
[1]:
import logging
logging.basicConfig(level=logging.INFO)
[2]:
%%time
import malaya
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
CPU times: user 5.38 s, sys: 899 ms, total: 6.28 s
Wall time: 6.42 s
Describe supported POS#
[3]:
malaya.pos.describe()
[3]:
Tag | Description | |
---|---|---|
0 | ADJ | Adjective, kata sifat |
1 | ADP | Adposition |
2 | ADV | Adverb, kata keterangan |
3 | ADX | Auxiliary verb, kata kerja tambahan |
4 | CCONJ | Coordinating conjuction, kata hubung |
5 | DET | Determiner, kata penentu |
6 | NOUN | Noun, kata nama |
7 | NUM | Number, nombor |
8 | PART | Particle |
9 | PRON | Pronoun, kata ganti |
10 | PROPN | Proper noun, kata ganti nama khas |
11 | SCONJ | Subordinating conjunction |
12 | SYM | Symbol |
13 | VERB | Verb, kata kerja |
14 | X | Other |
List available Transformer POS models#
[4]:
malaya.pos.available_transformer()
INFO:malaya.pos:trained on 80% dataset, tested on another 20% test set, dataset at https://github.com/huseinzol05/Malay-Dataset/tree/master/tagging/part-of-speech
[4]:
Size (MB) | Quantized Size (MB) | macro precision | macro recall | macro f1-score | |
---|---|---|---|---|---|
bert | 426.4 | 111.00 | 0.93280 | 0.93129 | 0.93181 |
tiny-bert | 57.7 | 15.40 | 0.92810 | 0.92649 | 0.92704 |
albert | 48.7 | 12.80 | 0.93199 | 0.91948 | 0.92547 |
tiny-albert | 22.4 | 5.98 | 0.90579 | 0.89501 | 0.90002 |
xlnet | 446.6 | 118.00 | 0.93303 | 0.93222 | 0.93236 |
alxlnet | 46.8 | 13.30 | 0.92732 | 0.93046 | 0.92819 |
[4]:
string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'
Load Transformer model#
def transformer(model: str = 'xlnet', quantized: bool = False, **kwargs):
"""
Load Transformer POS Tagging model, transfer learning Transformer + CRF.
Parameters
----------
model : str, optional (default='bert')
Model architecture supported. Allowed values:
* ``'bert'`` - Google BERT BASE parameters.
* ``'tiny-bert'`` - Google BERT TINY parameters.
* ``'albert'`` - Google ALBERT BASE parameters.
* ``'tiny-albert'`` - Google ALBERT TINY parameters.
* ``'xlnet'`` - Google XLNET BASE parameters.
* ``'alxlnet'`` - Malaya ALXLNET BASE parameters.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya.supervised.tag.transformer function
"""
[5]:
model = malaya.pos.transformer(model = 'albert')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
INFO:tensorflow:loading sentence piece model
INFO:tensorflow:loading sentence piece model
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
Load Quantized model#
To load 8-bit quantized model, simply pass quantized = True
, default is False
.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[6]:
quantized_model = malaya.pos.transformer(model = 'albert', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
INFO:tensorflow:loading sentence piece model
INFO:tensorflow:loading sentence piece model
Predict#
def predict(self, string: str):
"""
Tag a string.
Parameters
----------
string : str
Returns
-------
result: Tuple[str, str]
"""
[7]:
model.predict(string)
[7]:
[('KUALA', 'PROPN'),
('LUMPUR:', 'PROPN'),
('Sempena', 'ADP'),
('sambutan', 'NOUN'),
('Aidilfitri', 'NOUN'),
('minggu', 'NOUN'),
('depan,', 'ADJ'),
('Perdana', 'PROPN'),
('Menteri', 'PROPN'),
('Tun', 'PROPN'),
('Dr', 'PROPN'),
('Mahathir', 'PROPN'),
('Mohamad', 'PROPN'),
('dan', 'CCONJ'),
('Menteri', 'PROPN'),
('Pengangkutan', 'PROPN'),
('Anthony', 'PROPN'),
('Loke', 'PROPN'),
('Siew', 'PROPN'),
('Fook', 'PROPN'),
('menitipkan', 'VERB'),
('pesanan', 'NOUN'),
('khas', 'ADJ'),
('kepada', 'ADP'),
('orang', 'NOUN'),
('ramai', 'ADJ'),
('yang', 'PRON'),
('mahu', 'ADV'),
('pulang', 'VERB'),
('ke', 'ADP'),
('kampung', 'NOUN'),
('halaman', 'NOUN'),
('masing-masing.', 'DET'),
('Dalam', 'ADP'),
('video', 'NOUN'),
('pendek', 'ADJ'),
('terbitan', 'NOUN'),
('Jabatan', 'PROPN'),
('Keselamatan', 'PROPN'),
('Jalan', 'PROPN'),
('Raya', 'PROPN'),
('(JKJR)', 'PUNCT'),
('itu,', 'DET'),
('Dr', 'PROPN'),
('Mahathir', 'PROPN'),
('menasihati', 'VERB'),
('mereka', 'PRON'),
('supaya', 'SCONJ'),
('berhenti', 'VERB'),
('berehat', 'VERB'),
('dan', 'CCONJ'),
('tidur', 'VERB'),
('sebentar', 'NOUN'),
('sekiranya', 'SCONJ'),
('mengantuk', 'ADJ'),
('ketika', 'SCONJ'),
('memandu.', 'VERB')]
[8]:
quantized_model.predict(string)
[8]:
[('KUALA', 'PROPN'),
('LUMPUR:', 'PROPN'),
('Sempena', 'ADP'),
('sambutan', 'NOUN'),
('Aidilfitri', 'NOUN'),
('minggu', 'NOUN'),
('depan,', 'ADJ'),
('Perdana', 'PROPN'),
('Menteri', 'PROPN'),
('Tun', 'PROPN'),
('Dr', 'PROPN'),
('Mahathir', 'PROPN'),
('Mohamad', 'PROPN'),
('dan', 'CCONJ'),
('Menteri', 'PROPN'),
('Pengangkutan', 'PROPN'),
('Anthony', 'PROPN'),
('Loke', 'PROPN'),
('Siew', 'PROPN'),
('Fook', 'PROPN'),
('menitipkan', 'VERB'),
('pesanan', 'NOUN'),
('khas', 'ADJ'),
('kepada', 'ADP'),
('orang', 'NOUN'),
('ramai', 'ADJ'),
('yang', 'PRON'),
('mahu', 'ADV'),
('pulang', 'VERB'),
('ke', 'ADP'),
('kampung', 'NOUN'),
('halaman', 'NOUN'),
('masing-masing.', 'DET'),
('Dalam', 'ADP'),
('video', 'NOUN'),
('pendek', 'ADJ'),
('terbitan', 'NOUN'),
('Jabatan', 'PROPN'),
('Keselamatan', 'PROPN'),
('Jalan', 'PROPN'),
('Raya', 'PROPN'),
('(JKJR)', 'PUNCT'),
('itu,', 'DET'),
('Dr', 'PROPN'),
('Mahathir', 'PROPN'),
('menasihati', 'VERB'),
('mereka', 'PRON'),
('supaya', 'SCONJ'),
('berhenti', 'VERB'),
('berehat', 'VERB'),
('dan', 'CCONJ'),
('tidur', 'VERB'),
('sebentar', 'NOUN'),
('sekiranya', 'SCONJ'),
('mengantuk', 'ADJ'),
('ketika', 'SCONJ'),
('memandu.', 'VERB')]
Group similar tags#
def analyze(self, string: str):
"""
Analyze a string.
Parameters
----------
string : str
Returns
-------
result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location', 'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]}
"""
[9]:
model.analyze(string)
[9]:
{'words': ['KUALA',
'LUMPUR:',
'Sempena',
'sambutan',
'Aidilfitri',
'minggu',
'depan,',
'Perdana',
'Menteri',
'Tun',
'Dr',
'Mahathir',
'Mohamad',
'dan',
'Menteri',
'Pengangkutan',
'Anthony',
'Loke',
'Siew',
'Fook',
'menitipkan',
'pesanan',
'khas',
'kepada',
'orang',
'ramai',
'yang',
'mahu',
'pulang',
'ke',
'kampung',
'halaman',
'masing-masing.',
'Dalam',
'video',
'pendek',
'terbitan',
'Jabatan',
'Keselamatan',
'Jalan',
'Raya',
'(JKJR)',
'itu,',
'Dr',
'Mahathir',
'menasihati',
'mereka',
'supaya',
'berhenti',
'berehat',
'dan',
'tidur',
'sebentar',
'sekiranya',
'mengantuk',
'ketika',
'memandu.'],
'tags': [{'text': 'KUALA LUMPUR:',
'type': 'PROPN',
'score': 1.0,
'beginOffset': 0,
'endOffset': 1},
{'text': 'Sempena',
'type': 'ADP',
'score': 1.0,
'beginOffset': 2,
'endOffset': 2},
{'text': 'sambutan Aidilfitri minggu',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 3,
'endOffset': 5},
{'text': 'depan,',
'type': 'ADJ',
'score': 1.0,
'beginOffset': 6,
'endOffset': 6},
{'text': 'Perdana Menteri Tun Dr Mahathir Mohamad',
'type': 'PROPN',
'score': 1.0,
'beginOffset': 7,
'endOffset': 12},
{'text': 'dan',
'type': 'CCONJ',
'score': 1.0,
'beginOffset': 13,
'endOffset': 13},
{'text': 'Menteri Pengangkutan Anthony Loke Siew Fook',
'type': 'PROPN',
'score': 1.0,
'beginOffset': 14,
'endOffset': 19},
{'text': 'menitipkan',
'type': 'VERB',
'score': 1.0,
'beginOffset': 20,
'endOffset': 20},
{'text': 'pesanan',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 21,
'endOffset': 21},
{'text': 'khas',
'type': 'ADJ',
'score': 1.0,
'beginOffset': 22,
'endOffset': 22},
{'text': 'kepada',
'type': 'ADP',
'score': 1.0,
'beginOffset': 23,
'endOffset': 23},
{'text': 'orang',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 24,
'endOffset': 24},
{'text': 'ramai',
'type': 'ADJ',
'score': 1.0,
'beginOffset': 25,
'endOffset': 25},
{'text': 'yang',
'type': 'PRON',
'score': 1.0,
'beginOffset': 26,
'endOffset': 26},
{'text': 'mahu',
'type': 'ADV',
'score': 1.0,
'beginOffset': 27,
'endOffset': 27},
{'text': 'pulang',
'type': 'VERB',
'score': 1.0,
'beginOffset': 28,
'endOffset': 28},
{'text': 'ke',
'type': 'ADP',
'score': 1.0,
'beginOffset': 29,
'endOffset': 29},
{'text': 'kampung halaman',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 30,
'endOffset': 31},
{'text': 'masing-masing.',
'type': 'DET',
'score': 1.0,
'beginOffset': 32,
'endOffset': 32},
{'text': 'Dalam',
'type': 'ADP',
'score': 1.0,
'beginOffset': 33,
'endOffset': 33},
{'text': 'video',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 34,
'endOffset': 34},
{'text': 'pendek',
'type': 'ADJ',
'score': 1.0,
'beginOffset': 35,
'endOffset': 35},
{'text': 'terbitan',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 36,
'endOffset': 36},
{'text': 'Jabatan Keselamatan Jalan Raya',
'type': 'PROPN',
'score': 1.0,
'beginOffset': 37,
'endOffset': 40},
{'text': '(JKJR)',
'type': 'PUNCT',
'score': 1.0,
'beginOffset': 41,
'endOffset': 41},
{'text': 'itu,',
'type': 'DET',
'score': 1.0,
'beginOffset': 42,
'endOffset': 42},
{'text': 'Dr Mahathir',
'type': 'PROPN',
'score': 1.0,
'beginOffset': 43,
'endOffset': 44},
{'text': 'menasihati',
'type': 'VERB',
'score': 1.0,
'beginOffset': 45,
'endOffset': 45},
{'text': 'mereka',
'type': 'PRON',
'score': 1.0,
'beginOffset': 46,
'endOffset': 46},
{'text': 'supaya',
'type': 'SCONJ',
'score': 1.0,
'beginOffset': 47,
'endOffset': 47},
{'text': 'berhenti berehat',
'type': 'VERB',
'score': 1.0,
'beginOffset': 48,
'endOffset': 49},
{'text': 'dan',
'type': 'CCONJ',
'score': 1.0,
'beginOffset': 50,
'endOffset': 50},
{'text': 'tidur',
'type': 'VERB',
'score': 1.0,
'beginOffset': 51,
'endOffset': 51},
{'text': 'sebentar',
'type': 'NOUN',
'score': 1.0,
'beginOffset': 52,
'endOffset': 52},
{'text': 'sekiranya',
'type': 'SCONJ',
'score': 1.0,
'beginOffset': 53,
'endOffset': 53},
{'text': 'mengantuk',
'type': 'ADJ',
'score': 1.0,
'beginOffset': 54,
'endOffset': 54},
{'text': 'ketika',
'type': 'SCONJ',
'score': 1.0,
'beginOffset': 55,
'endOffset': 55}]}
Vectorize#
Let say you want to visualize word level in lower dimension, you can use model.vectorize
,
def vectorize(self, string: str):
"""
vectorize a string.
Parameters
----------
string: List[str]
Returns
-------
result: np.array
"""
[10]:
strings = [string,
'Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais',
'contact Husein at husein.zol05@gmail.com',
'tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek']
[11]:
r = [quantized_model.vectorize(string) for string in strings]
[12]:
x, y = [], []
for row in r:
x.extend([i[0] for i in row])
y.extend([i[1] for i in row])
[13]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE().fit_transform(y)
tsne.shape
[13]:
(108, 2)
[14]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
labels, tsne[:, 0], tsne[:, 1]
):
label = (
'%s, %.3f' % (label[0], label[1])
if isinstance(label, list)
else label
)
plt.annotate(
label,
xy = (x, y),
xytext = (0, 0),
textcoords = 'offset points',
)

Pretty good, the model able to know cluster similar part-of-speech.
Voting stack model#
[16]:
alxlnet = malaya.pos.transformer(model = 'alxlnet')
malaya.stack.voting_stack([model, alxlnet, alxlnet], string)
[16]:
[('KUALA', 'PROPN'),
('LUMPUR:', 'PROPN'),
('Sempena', 'ADP'),
('sambutan', 'NOUN'),
('Aidilfitri', 'PROPN'),
('minggu', 'NOUN'),
('depan,', 'ADJ'),
('Perdana', 'PROPN'),
('Menteri', 'PROPN'),
('Tun', 'PROPN'),
('Dr', 'PROPN'),
('Mahathir', 'PROPN'),
('Mohamad', 'PROPN'),
('dan', 'CCONJ'),
('Menteri', 'PROPN'),
('Pengangkutan', 'PROPN'),
('Anthony', 'PROPN'),
('Loke', 'PROPN'),
('Siew', 'PROPN'),
('Fook', 'PROPN'),
('menitipkan', 'VERB'),
('pesanan', 'NOUN'),
('khas', 'ADJ'),
('kepada', 'ADP'),
('orang', 'NOUN'),
('ramai', 'ADJ'),
('yang', 'PRON'),
('mahu', 'ADV'),
('pulang', 'VERB'),
('ke', 'ADP'),
('kampung', 'NOUN'),
('halaman', 'NOUN'),
('masing-masing.', 'ADV'),
('Dalam', 'ADP'),
('video', 'NOUN'),
('pendek', 'ADJ'),
('terbitan', 'NOUN'),
('Jabatan', 'NOUN'),
('Keselamatan', 'PROPN'),
('Jalan', 'PROPN'),
('Raya', 'PROPN'),
('(JKJR)', 'PUNCT'),
('itu,', 'DET'),
('Dr', 'PROPN'),
('Mahathir', 'PROPN'),
('menasihati', 'VERB'),
('mereka', 'PRON'),
('supaya', 'SCONJ'),
('berhenti', 'VERB'),
('berehat', 'VERB'),
('dan', 'CCONJ'),
('tidur', 'VERB'),
('sebentar', 'ADV'),
('sekiranya', 'SCONJ'),
('mengantuk', 'ADJ'),
('ketika', 'SCONJ'),
('memandu.', 'VERB')]
[ ]: