Part-of-Speech Recognition

This tutorial is available as an IPython notebook at Malaya/example/part-of-speech.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:
%%time
import malaya
CPU times: user 4.09 s, sys: 556 ms, total: 4.65 s
Wall time: 3.75 s

Describe supported POS

[2]:
malaya.pos.describe()
[2]:
Tag Description
0 ADJ Adjective, kata sifat
1 ADP Adposition
2 ADV Adverb, kata keterangan
3 ADX Auxiliary verb, kata kerja tambahan
4 CCONJ Coordinating conjuction, kata hubung
5 DET Determiner, kata penentu
6 NOUN Noun, kata nama
7 NUM Number, nombor
8 PART Particle
9 PRON Pronoun, kata ganti
10 PROPN Proper noun, kata ganti nama khas
11 SCONJ Subordinating conjunction
12 SYM Symbol
13 VERB Verb, kata kerja
14 X Other

List available Transformer POS models

[3]:
malaya.pos.available_transformer()
INFO:root:tested on 20% test set.
[3]:
Size (MB) Quantized Size (MB) macro precision macro recall macro f1-score
bert 426.4 111.00 0.93280 0.93129 0.93181
tiny-bert 57.7 15.40 0.92810 0.92649 0.92704
albert 48.7 12.80 0.93199 0.91948 0.92547
tiny-albert 22.4 5.98 0.90579 0.89501 0.90002
xlnet 446.6 118.00 0.93303 0.93222 0.93236
alxlnet 46.8 13.30 0.92732 0.93046 0.92819

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/Accuracy.html#pos-recognition

You might want to use Tiny-Albert, a very small size, 22.4MB, but the accuracy is still on the top notch.

[4]:
string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar  sekiranya mengantuk ketika memandu.'

Load Transformer model

def transformer(model: str = 'xlnet', quantized: bool = False, **kwargs):
    """
    Load Transformer POS Tagging model, transfer learning Transformer + CRF.

    Parameters
    ----------
    model : str, optional (default='bert')
        Model architecture supported. Allowed values:

        * ``'bert'`` - Google BERT BASE parameters.
        * ``'tiny-bert'`` - Google BERT TINY parameters.
        * ``'albert'`` - Google ALBERT BASE parameters.
        * ``'tiny-albert'`` - Google ALBERT TINY parameters.
        * ``'xlnet'`` - Google XLNET BASE parameters.
        * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya.supervised.tag.transformer function
    """
[5]:
model = malaya.pos.transformer(model = 'albert')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:loading sentence piece model
INFO:tensorflow:loading sentence piece model
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[6]:
quantized_model = malaya.pos.transformer(model = 'albert', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
INFO:tensorflow:loading sentence piece model
INFO:tensorflow:loading sentence piece model

Predict

def predict(self, string: str):
    """
    Tag a string.

    Parameters
    ----------
    string : str

    Returns
    -------
    result: Tuple[str, str]
    """
[7]:
model.predict(string)
[7]:
[('KUALA', 'PROPN'),
 ('LUMPUR:', 'PROPN'),
 ('Sempena', 'ADP'),
 ('sambutan', 'NOUN'),
 ('Aidilfitri', 'NOUN'),
 ('minggu', 'NOUN'),
 ('depan,', 'ADJ'),
 ('Perdana', 'PROPN'),
 ('Menteri', 'PROPN'),
 ('Tun', 'PROPN'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('Mohamad', 'PROPN'),
 ('dan', 'CCONJ'),
 ('Menteri', 'PROPN'),
 ('Pengangkutan', 'PROPN'),
 ('Anthony', 'PROPN'),
 ('Loke', 'PROPN'),
 ('Siew', 'PROPN'),
 ('Fook', 'PROPN'),
 ('menitipkan', 'VERB'),
 ('pesanan', 'NOUN'),
 ('khas', 'ADJ'),
 ('kepada', 'ADP'),
 ('orang', 'NOUN'),
 ('ramai', 'ADJ'),
 ('yang', 'PRON'),
 ('mahu', 'ADV'),
 ('pulang', 'VERB'),
 ('ke', 'ADP'),
 ('kampung', 'NOUN'),
 ('halaman', 'NOUN'),
 ('masing-masing.', 'DET'),
 ('Dalam', 'ADP'),
 ('video', 'NOUN'),
 ('pendek', 'ADJ'),
 ('terbitan', 'NOUN'),
 ('Jabatan', 'PROPN'),
 ('Keselamatan', 'PROPN'),
 ('Jalan', 'PROPN'),
 ('Raya', 'PROPN'),
 ('(JKJR)', 'PUNCT'),
 ('itu,', 'DET'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('menasihati', 'VERB'),
 ('mereka', 'PRON'),
 ('supaya', 'SCONJ'),
 ('berhenti', 'VERB'),
 ('berehat', 'VERB'),
 ('dan', 'CCONJ'),
 ('tidur', 'VERB'),
 ('sebentar', 'NOUN'),
 ('sekiranya', 'SCONJ'),
 ('mengantuk', 'ADJ'),
 ('ketika', 'SCONJ'),
 ('memandu.', 'VERB')]
[8]:
quantized_model.predict(string)
[8]:
[('KUALA', 'PROPN'),
 ('LUMPUR:', 'PROPN'),
 ('Sempena', 'ADP'),
 ('sambutan', 'NOUN'),
 ('Aidilfitri', 'NOUN'),
 ('minggu', 'NOUN'),
 ('depan,', 'ADJ'),
 ('Perdana', 'PROPN'),
 ('Menteri', 'PROPN'),
 ('Tun', 'PROPN'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('Mohamad', 'PROPN'),
 ('dan', 'CCONJ'),
 ('Menteri', 'PROPN'),
 ('Pengangkutan', 'PROPN'),
 ('Anthony', 'PROPN'),
 ('Loke', 'PROPN'),
 ('Siew', 'PROPN'),
 ('Fook', 'PROPN'),
 ('menitipkan', 'VERB'),
 ('pesanan', 'NOUN'),
 ('khas', 'ADJ'),
 ('kepada', 'ADP'),
 ('orang', 'NOUN'),
 ('ramai', 'ADJ'),
 ('yang', 'PRON'),
 ('mahu', 'ADV'),
 ('pulang', 'VERB'),
 ('ke', 'ADP'),
 ('kampung', 'NOUN'),
 ('halaman', 'NOUN'),
 ('masing-masing.', 'DET'),
 ('Dalam', 'ADP'),
 ('video', 'NOUN'),
 ('pendek', 'ADJ'),
 ('terbitan', 'NOUN'),
 ('Jabatan', 'PROPN'),
 ('Keselamatan', 'PROPN'),
 ('Jalan', 'PROPN'),
 ('Raya', 'PROPN'),
 ('(JKJR)', 'PUNCT'),
 ('itu,', 'DET'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('menasihati', 'VERB'),
 ('mereka', 'PRON'),
 ('supaya', 'SCONJ'),
 ('berhenti', 'VERB'),
 ('berehat', 'VERB'),
 ('dan', 'CCONJ'),
 ('tidur', 'VERB'),
 ('sebentar', 'NOUN'),
 ('sekiranya', 'SCONJ'),
 ('mengantuk', 'ADJ'),
 ('ketika', 'SCONJ'),
 ('memandu.', 'VERB')]

Group similar tags

def analyze(self, string: str):
        """
        Analyze a string.

        Parameters
        ----------
        string : str

        Returns
        -------
        result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location', 'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]}
        """
[9]:
model.analyze(string)
[9]:
{'words': ['KUALA',
  'LUMPUR:',
  'Sempena',
  'sambutan',
  'Aidilfitri',
  'minggu',
  'depan,',
  'Perdana',
  'Menteri',
  'Tun',
  'Dr',
  'Mahathir',
  'Mohamad',
  'dan',
  'Menteri',
  'Pengangkutan',
  'Anthony',
  'Loke',
  'Siew',
  'Fook',
  'menitipkan',
  'pesanan',
  'khas',
  'kepada',
  'orang',
  'ramai',
  'yang',
  'mahu',
  'pulang',
  'ke',
  'kampung',
  'halaman',
  'masing-masing.',
  'Dalam',
  'video',
  'pendek',
  'terbitan',
  'Jabatan',
  'Keselamatan',
  'Jalan',
  'Raya',
  '(JKJR)',
  'itu,',
  'Dr',
  'Mahathir',
  'menasihati',
  'mereka',
  'supaya',
  'berhenti',
  'berehat',
  'dan',
  'tidur',
  'sebentar',
  'sekiranya',
  'mengantuk',
  'ketika',
  'memandu.'],
 'tags': [{'text': 'KUALA LUMPUR:',
   'type': 'PROPN',
   'score': 1.0,
   'beginOffset': 0,
   'endOffset': 1},
  {'text': 'Sempena',
   'type': 'ADP',
   'score': 1.0,
   'beginOffset': 2,
   'endOffset': 2},
  {'text': 'sambutan Aidilfitri minggu',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 3,
   'endOffset': 5},
  {'text': 'depan,',
   'type': 'ADJ',
   'score': 1.0,
   'beginOffset': 6,
   'endOffset': 6},
  {'text': 'Perdana Menteri Tun Dr Mahathir Mohamad',
   'type': 'PROPN',
   'score': 1.0,
   'beginOffset': 7,
   'endOffset': 12},
  {'text': 'dan',
   'type': 'CCONJ',
   'score': 1.0,
   'beginOffset': 13,
   'endOffset': 13},
  {'text': 'Menteri Pengangkutan Anthony Loke Siew Fook',
   'type': 'PROPN',
   'score': 1.0,
   'beginOffset': 14,
   'endOffset': 19},
  {'text': 'menitipkan',
   'type': 'VERB',
   'score': 1.0,
   'beginOffset': 20,
   'endOffset': 20},
  {'text': 'pesanan',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 21,
   'endOffset': 21},
  {'text': 'khas',
   'type': 'ADJ',
   'score': 1.0,
   'beginOffset': 22,
   'endOffset': 22},
  {'text': 'kepada',
   'type': 'ADP',
   'score': 1.0,
   'beginOffset': 23,
   'endOffset': 23},
  {'text': 'orang',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 24,
   'endOffset': 24},
  {'text': 'ramai',
   'type': 'ADJ',
   'score': 1.0,
   'beginOffset': 25,
   'endOffset': 25},
  {'text': 'yang',
   'type': 'PRON',
   'score': 1.0,
   'beginOffset': 26,
   'endOffset': 26},
  {'text': 'mahu',
   'type': 'ADV',
   'score': 1.0,
   'beginOffset': 27,
   'endOffset': 27},
  {'text': 'pulang',
   'type': 'VERB',
   'score': 1.0,
   'beginOffset': 28,
   'endOffset': 28},
  {'text': 'ke',
   'type': 'ADP',
   'score': 1.0,
   'beginOffset': 29,
   'endOffset': 29},
  {'text': 'kampung halaman',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 30,
   'endOffset': 31},
  {'text': 'masing-masing.',
   'type': 'DET',
   'score': 1.0,
   'beginOffset': 32,
   'endOffset': 32},
  {'text': 'Dalam',
   'type': 'ADP',
   'score': 1.0,
   'beginOffset': 33,
   'endOffset': 33},
  {'text': 'video',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 34,
   'endOffset': 34},
  {'text': 'pendek',
   'type': 'ADJ',
   'score': 1.0,
   'beginOffset': 35,
   'endOffset': 35},
  {'text': 'terbitan',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 36,
   'endOffset': 36},
  {'text': 'Jabatan Keselamatan Jalan Raya',
   'type': 'PROPN',
   'score': 1.0,
   'beginOffset': 37,
   'endOffset': 40},
  {'text': '(JKJR)',
   'type': 'PUNCT',
   'score': 1.0,
   'beginOffset': 41,
   'endOffset': 41},
  {'text': 'itu,',
   'type': 'DET',
   'score': 1.0,
   'beginOffset': 42,
   'endOffset': 42},
  {'text': 'Dr Mahathir',
   'type': 'PROPN',
   'score': 1.0,
   'beginOffset': 43,
   'endOffset': 44},
  {'text': 'menasihati',
   'type': 'VERB',
   'score': 1.0,
   'beginOffset': 45,
   'endOffset': 45},
  {'text': 'mereka',
   'type': 'PRON',
   'score': 1.0,
   'beginOffset': 46,
   'endOffset': 46},
  {'text': 'supaya',
   'type': 'SCONJ',
   'score': 1.0,
   'beginOffset': 47,
   'endOffset': 47},
  {'text': 'berhenti berehat',
   'type': 'VERB',
   'score': 1.0,
   'beginOffset': 48,
   'endOffset': 49},
  {'text': 'dan',
   'type': 'CCONJ',
   'score': 1.0,
   'beginOffset': 50,
   'endOffset': 50},
  {'text': 'tidur',
   'type': 'VERB',
   'score': 1.0,
   'beginOffset': 51,
   'endOffset': 51},
  {'text': 'sebentar',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 52,
   'endOffset': 52},
  {'text': 'sekiranya',
   'type': 'SCONJ',
   'score': 1.0,
   'beginOffset': 53,
   'endOffset': 53},
  {'text': 'mengantuk',
   'type': 'ADJ',
   'score': 1.0,
   'beginOffset': 54,
   'endOffset': 54},
  {'text': 'ketika',
   'type': 'SCONJ',
   'score': 1.0,
   'beginOffset': 55,
   'endOffset': 55}]}

Vectorize

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str):
    """
    vectorize a string.

    Parameters
    ----------
    string: List[str]

    Returns
    -------
    result: np.array
    """
[10]:
strings = [string,
          'Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais',
          'contact Husein at husein.zol05@gmail.com',
          'tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek']
[11]:
r = [quantized_model.vectorize(string) for string in strings]
[12]:
x, y = [], []
for row in r:
    x.extend([i[0] for i in row])
    y.extend([i[1] for i in row])
[13]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(y)
tsne.shape
[13]:
(108, 2)
[14]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-pos_24_0.png

Pretty good, the model able to know cluster similar part-of-speech.

Voting stack model

[16]:
alxlnet = malaya.pos.transformer(model = 'alxlnet')
malaya.stack.voting_stack([model, alxlnet, alxlnet], string)
[16]:
[('KUALA', 'PROPN'),
 ('LUMPUR:', 'PROPN'),
 ('Sempena', 'ADP'),
 ('sambutan', 'NOUN'),
 ('Aidilfitri', 'PROPN'),
 ('minggu', 'NOUN'),
 ('depan,', 'ADJ'),
 ('Perdana', 'PROPN'),
 ('Menteri', 'PROPN'),
 ('Tun', 'PROPN'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('Mohamad', 'PROPN'),
 ('dan', 'CCONJ'),
 ('Menteri', 'PROPN'),
 ('Pengangkutan', 'PROPN'),
 ('Anthony', 'PROPN'),
 ('Loke', 'PROPN'),
 ('Siew', 'PROPN'),
 ('Fook', 'PROPN'),
 ('menitipkan', 'VERB'),
 ('pesanan', 'NOUN'),
 ('khas', 'ADJ'),
 ('kepada', 'ADP'),
 ('orang', 'NOUN'),
 ('ramai', 'ADJ'),
 ('yang', 'PRON'),
 ('mahu', 'ADV'),
 ('pulang', 'VERB'),
 ('ke', 'ADP'),
 ('kampung', 'NOUN'),
 ('halaman', 'NOUN'),
 ('masing-masing.', 'ADV'),
 ('Dalam', 'ADP'),
 ('video', 'NOUN'),
 ('pendek', 'ADJ'),
 ('terbitan', 'NOUN'),
 ('Jabatan', 'NOUN'),
 ('Keselamatan', 'PROPN'),
 ('Jalan', 'PROPN'),
 ('Raya', 'PROPN'),
 ('(JKJR)', 'PUNCT'),
 ('itu,', 'DET'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('menasihati', 'VERB'),
 ('mereka', 'PRON'),
 ('supaya', 'SCONJ'),
 ('berhenti', 'VERB'),
 ('berehat', 'VERB'),
 ('dan', 'CCONJ'),
 ('tidur', 'VERB'),
 ('sebentar', 'ADV'),
 ('sekiranya', 'SCONJ'),
 ('mengantuk', 'ADJ'),
 ('ketika', 'SCONJ'),
 ('memandu.', 'VERB')]
[ ]: