Part-of-Speech Recognition#

This tutorial is available as an IPython notebook at Malaya/example/part-of-speech.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:

import logging

logging.basicConfig(level=logging.INFO)

[2]:

%%time
import malaya

INFO:numexpr.utils:NumExpr defaulting to 8 threads.

CPU times: user 5.38 s, sys: 899 ms, total: 6.28 s
Wall time: 6.42 s

Describe supported POS#

[3]:

malaya.pos.describe()

[3]:

	Tag	Description
0	ADJ	Adjective, kata sifat
1	ADP	Adposition
2	ADV	Adverb, kata keterangan
3	ADX	Auxiliary verb, kata kerja tambahan
4	CCONJ	Coordinating conjuction, kata hubung
5	DET	Determiner, kata penentu
6	NOUN	Noun, kata nama
7	NUM	Number, nombor
8	PART	Particle
9	PRON	Pronoun, kata ganti
10	PROPN	Proper noun, kata ganti nama khas
11	SCONJ	Subordinating conjunction
12	SYM	Symbol
13	VERB	Verb, kata kerja
14	X	Other

List available Transformer POS models#

[4]:

malaya.pos.available_transformer()

INFO:malaya.pos:trained on 80% dataset, tested on another 20% test set, dataset at https://github.com/huseinzol05/Malay-Dataset/tree/master/tagging/part-of-speech

[4]:

	Size (MB)	Quantized Size (MB)	macro precision	macro recall	macro f1-score
bert	426.4	111.00	0.93280	0.93129	0.93181
tiny-bert	57.7	15.40	0.92810	0.92649	0.92704
albert	48.7	12.80	0.93199	0.91948	0.92547
tiny-albert	22.4	5.98	0.90579	0.89501	0.90002
xlnet	446.6	118.00	0.93303	0.93222	0.93236
alxlnet	46.8	13.30	0.92732	0.93046	0.92819

[4]:

string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar  sekiranya mengantuk ketika memandu.'

Load Transformer model#

def transformer(model: str = 'xlnet', quantized: bool = False, **kwargs):
    """
    Load Transformer POS Tagging model, transfer learning Transformer + CRF.

    Parameters
    ----------
    model : str, optional (default='bert')
        Model architecture supported. Allowed values:

        * ``'bert'`` - Google BERT BASE parameters.
        * ``'tiny-bert'`` - Google BERT TINY parameters.
        * ``'albert'`` - Google ALBERT BASE parameters.
        * ``'tiny-albert'`` - Google ALBERT TINY parameters.
        * ``'xlnet'`` - Google XLNET BASE parameters.
        * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya.supervised.tag.transformer function
    """

[5]:

model = malaya.pos.transformer(model = 'albert')

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:loading sentence piece model

INFO:tensorflow:loading sentence piece model

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

Load Quantized model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[6]:

quantized_model = malaya.pos.transformer(model = 'albert', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.

INFO:tensorflow:loading sentence piece model

INFO:tensorflow:loading sentence piece model

Predict#

def predict(self, string: str):
    """
    Tag a string.

    Parameters
    ----------
    string : str

    Returns
    -------
    result: Tuple[str, str]
    """

[7]:

model.predict(string)

[7]:

[('KUALA', 'PROPN'),
 ('LUMPUR:', 'PROPN'),
 ('Sempena', 'ADP'),
 ('sambutan', 'NOUN'),
 ('Aidilfitri', 'NOUN'),
 ('minggu', 'NOUN'),
 ('depan,', 'ADJ'),
 ('Perdana', 'PROPN'),
 ('Menteri', 'PROPN'),
 ('Tun', 'PROPN'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('Mohamad', 'PROPN'),
 ('dan', 'CCONJ'),
 ('Menteri', 'PROPN'),
 ('Pengangkutan', 'PROPN'),
 ('Anthony', 'PROPN'),
 ('Loke', 'PROPN'),
 ('Siew', 'PROPN'),
 ('Fook', 'PROPN'),
 ('menitipkan', 'VERB'),
 ('pesanan', 'NOUN'),
 ('khas', 'ADJ'),
 ('kepada', 'ADP'),
 ('orang', 'NOUN'),
 ('ramai', 'ADJ'),
 ('yang', 'PRON'),
 ('mahu', 'ADV'),
 ('pulang', 'VERB'),
 ('ke', 'ADP'),
 ('kampung', 'NOUN'),
 ('halaman', 'NOUN'),
 ('masing-masing.', 'DET'),
 ('Dalam', 'ADP'),
 ('video', 'NOUN'),
 ('pendek', 'ADJ'),
 ('terbitan', 'NOUN'),
 ('Jabatan', 'PROPN'),
 ('Keselamatan', 'PROPN'),
 ('Jalan', 'PROPN'),
 ('Raya', 'PROPN'),
 ('(JKJR)', 'PUNCT'),
 ('itu,', 'DET'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('menasihati', 'VERB'),
 ('mereka', 'PRON'),
 ('supaya', 'SCONJ'),
 ('berhenti', 'VERB'),
 ('berehat', 'VERB'),
 ('dan', 'CCONJ'),
 ('tidur', 'VERB'),
 ('sebentar', 'NOUN'),
 ('sekiranya', 'SCONJ'),
 ('mengantuk', 'ADJ'),
 ('ketika', 'SCONJ'),
 ('memandu.', 'VERB')]

[8]:

quantized_model.predict(string)

[8]:

[('KUALA', 'PROPN'),
 ('LUMPUR:', 'PROPN'),
 ('Sempena', 'ADP'),
 ('sambutan', 'NOUN'),
 ('Aidilfitri', 'NOUN'),
 ('minggu', 'NOUN'),
 ('depan,', 'ADJ'),
 ('Perdana', 'PROPN'),
 ('Menteri', 'PROPN'),
 ('Tun', 'PROPN'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('Mohamad', 'PROPN'),
 ('dan', 'CCONJ'),
 ('Menteri', 'PROPN'),
 ('Pengangkutan', 'PROPN'),
 ('Anthony', 'PROPN'),
 ('Loke', 'PROPN'),
 ('Siew', 'PROPN'),
 ('Fook', 'PROPN'),
 ('menitipkan', 'VERB'),
 ('pesanan', 'NOUN'),
 ('khas', 'ADJ'),
 ('kepada', 'ADP'),
 ('orang', 'NOUN'),
 ('ramai', 'ADJ'),
 ('yang', 'PRON'),
 ('mahu', 'ADV'),
 ('pulang', 'VERB'),
 ('ke', 'ADP'),
 ('kampung', 'NOUN'),
 ('halaman', 'NOUN'),
 ('masing-masing.', 'DET'),
 ('Dalam', 'ADP'),
 ('video', 'NOUN'),
 ('pendek', 'ADJ'),
 ('terbitan', 'NOUN'),
 ('Jabatan', 'PROPN'),
 ('Keselamatan', 'PROPN'),
 ('Jalan', 'PROPN'),
 ('Raya', 'PROPN'),
 ('(JKJR)', 'PUNCT'),
 ('itu,', 'DET'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('menasihati', 'VERB'),
 ('mereka', 'PRON'),
 ('supaya', 'SCONJ'),
 ('berhenti', 'VERB'),
 ('berehat', 'VERB'),
 ('dan', 'CCONJ'),
 ('tidur', 'VERB'),
 ('sebentar', 'NOUN'),
 ('sekiranya', 'SCONJ'),
 ('mengantuk', 'ADJ'),
 ('ketika', 'SCONJ'),
 ('memandu.', 'VERB')]

Group similar tags#

def analyze(self, string: str):
        """
        Analyze a string.

        Parameters
        ----------
        string : str

        Returns
        -------
        result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location', 'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]}
        """

[9]:

model.analyze(string)

[9]:

{'words': ['KUALA',
  'LUMPUR:',
  'Sempena',
  'sambutan',
  'Aidilfitri',
  'minggu',
  'depan,',
  'Perdana',
  'Menteri',
  'Tun',
  'Dr',
  'Mahathir',
  'Mohamad',
  'dan',
  'Menteri',
  'Pengangkutan',
  'Anthony',
  'Loke',
  'Siew',
  'Fook',
  'menitipkan',
  'pesanan',
  'khas',
  'kepada',
  'orang',
  'ramai',
  'yang',
  'mahu',
  'pulang',
  'ke',
  'kampung',
  'halaman',
  'masing-masing.',
  'Dalam',
  'video',
  'pendek',
  'terbitan',
  'Jabatan',
  'Keselamatan',
  'Jalan',
  'Raya',
  '(JKJR)',
  'itu,',
  'Dr',
  'Mahathir',
  'menasihati',
  'mereka',
  'supaya',
  'berhenti',
  'berehat',
  'dan',
  'tidur',
  'sebentar',
  'sekiranya',
  'mengantuk',
  'ketika',
  'memandu.'],
 'tags': [{'text': 'KUALA LUMPUR:',
   'type': 'PROPN',
   'score': 1.0,
   'beginOffset': 0,
   'endOffset': 1},
  {'text': 'Sempena',
   'type': 'ADP',
   'score': 1.0,
   'beginOffset': 2,
   'endOffset': 2},
  {'text': 'sambutan Aidilfitri minggu',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 3,
   'endOffset': 5},
  {'text': 'depan,',
   'type': 'ADJ',
   'score': 1.0,
   'beginOffset': 6,
   'endOffset': 6},
  {'text': 'Perdana Menteri Tun Dr Mahathir Mohamad',
   'type': 'PROPN',
   'score': 1.0,
   'beginOffset': 7,
   'endOffset': 12},
  {'text': 'dan',
   'type': 'CCONJ',
   'score': 1.0,
   'beginOffset': 13,
   'endOffset': 13},
  {'text': 'Menteri Pengangkutan Anthony Loke Siew Fook',
   'type': 'PROPN',
   'score': 1.0,
   'beginOffset': 14,
   'endOffset': 19},
  {'text': 'menitipkan',
   'type': 'VERB',
   'score': 1.0,
   'beginOffset': 20,
   'endOffset': 20},
  {'text': 'pesanan',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 21,
   'endOffset': 21},
  {'text': 'khas',
   'type': 'ADJ',
   'score': 1.0,
   'beginOffset': 22,
   'endOffset': 22},
  {'text': 'kepada',
   'type': 'ADP',
   'score': 1.0,
   'beginOffset': 23,
   'endOffset': 23},
  {'text': 'orang',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 24,
   'endOffset': 24},
  {'text': 'ramai',
   'type': 'ADJ',
   'score': 1.0,
   'beginOffset': 25,
   'endOffset': 25},
  {'text': 'yang',
   'type': 'PRON',
   'score': 1.0,
   'beginOffset': 26,
   'endOffset': 26},
  {'text': 'mahu',
   'type': 'ADV',
   'score': 1.0,
   'beginOffset': 27,
   'endOffset': 27},
  {'text': 'pulang',
   'type': 'VERB',
   'score': 1.0,
   'beginOffset': 28,
   'endOffset': 28},
  {'text': 'ke',
   'type': 'ADP',
   'score': 1.0,
   'beginOffset': 29,
   'endOffset': 29},
  {'text': 'kampung halaman',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 30,
   'endOffset': 31},
  {'text': 'masing-masing.',
   'type': 'DET',
   'score': 1.0,
   'beginOffset': 32,
   'endOffset': 32},
  {'text': 'Dalam',
   'type': 'ADP',
   'score': 1.0,
   'beginOffset': 33,
   'endOffset': 33},
  {'text': 'video',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 34,
   'endOffset': 34},
  {'text': 'pendek',
   'type': 'ADJ',
   'score': 1.0,
   'beginOffset': 35,
   'endOffset': 35},
  {'text': 'terbitan',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 36,
   'endOffset': 36},
  {'text': 'Jabatan Keselamatan Jalan Raya',
   'type': 'PROPN',
   'score': 1.0,
   'beginOffset': 37,
   'endOffset': 40},
  {'text': '(JKJR)',
   'type': 'PUNCT',
   'score': 1.0,
   'beginOffset': 41,
   'endOffset': 41},
  {'text': 'itu,',
   'type': 'DET',
   'score': 1.0,
   'beginOffset': 42,
   'endOffset': 42},
  {'text': 'Dr Mahathir',
   'type': 'PROPN',
   'score': 1.0,
   'beginOffset': 43,
   'endOffset': 44},
  {'text': 'menasihati',
   'type': 'VERB',
   'score': 1.0,
   'beginOffset': 45,
   'endOffset': 45},
  {'text': 'mereka',
   'type': 'PRON',
   'score': 1.0,
   'beginOffset': 46,
   'endOffset': 46},
  {'text': 'supaya',
   'type': 'SCONJ',
   'score': 1.0,
   'beginOffset': 47,
   'endOffset': 47},
  {'text': 'berhenti berehat',
   'type': 'VERB',
   'score': 1.0,
   'beginOffset': 48,
   'endOffset': 49},
  {'text': 'dan',
   'type': 'CCONJ',
   'score': 1.0,
   'beginOffset': 50,
   'endOffset': 50},
  {'text': 'tidur',
   'type': 'VERB',
   'score': 1.0,
   'beginOffset': 51,
   'endOffset': 51},
  {'text': 'sebentar',
   'type': 'NOUN',
   'score': 1.0,
   'beginOffset': 52,
   'endOffset': 52},
  {'text': 'sekiranya',
   'type': 'SCONJ',
   'score': 1.0,
   'beginOffset': 53,
   'endOffset': 53},
  {'text': 'mengantuk',
   'type': 'ADJ',
   'score': 1.0,
   'beginOffset': 54,
   'endOffset': 54},
  {'text': 'ketika',
   'type': 'SCONJ',
   'score': 1.0,
   'beginOffset': 55,
   'endOffset': 55}]}

Vectorize#

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str):
    """
    vectorize a string.

    Parameters
    ----------
    string: List[str]

    Returns
    -------
    result: np.array
    """

[10]:

strings = [string,
          'Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais',
          'contact Husein at husein.zol05@gmail.com',
          'tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek']

[11]:

r = [quantized_model.vectorize(string) for string in strings]

[12]:

x, y = [], []
for row in r:
    x.extend([i[0] for i in row])
    y.extend([i[1] for i in row])

[13]:

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(y)
tsne.shape

[13]:

(108, 2)

[14]:

plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )

Pretty good, the model able to know cluster similar part-of-speech.

Voting stack model#

[16]:

alxlnet = malaya.pos.transformer(model = 'alxlnet')
malaya.stack.voting_stack([model, alxlnet, alxlnet], string)

[16]:

[('KUALA', 'PROPN'),
 ('LUMPUR:', 'PROPN'),
 ('Sempena', 'ADP'),
 ('sambutan', 'NOUN'),
 ('Aidilfitri', 'PROPN'),
 ('minggu', 'NOUN'),
 ('depan,', 'ADJ'),
 ('Perdana', 'PROPN'),
 ('Menteri', 'PROPN'),
 ('Tun', 'PROPN'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('Mohamad', 'PROPN'),
 ('dan', 'CCONJ'),
 ('Menteri', 'PROPN'),
 ('Pengangkutan', 'PROPN'),
 ('Anthony', 'PROPN'),
 ('Loke', 'PROPN'),
 ('Siew', 'PROPN'),
 ('Fook', 'PROPN'),
 ('menitipkan', 'VERB'),
 ('pesanan', 'NOUN'),
 ('khas', 'ADJ'),
 ('kepada', 'ADP'),
 ('orang', 'NOUN'),
 ('ramai', 'ADJ'),
 ('yang', 'PRON'),
 ('mahu', 'ADV'),
 ('pulang', 'VERB'),
 ('ke', 'ADP'),
 ('kampung', 'NOUN'),
 ('halaman', 'NOUN'),
 ('masing-masing.', 'ADV'),
 ('Dalam', 'ADP'),
 ('video', 'NOUN'),
 ('pendek', 'ADJ'),
 ('terbitan', 'NOUN'),
 ('Jabatan', 'NOUN'),
 ('Keselamatan', 'PROPN'),
 ('Jalan', 'PROPN'),
 ('Raya', 'PROPN'),
 ('(JKJR)', 'PUNCT'),
 ('itu,', 'DET'),
 ('Dr', 'PROPN'),
 ('Mahathir', 'PROPN'),
 ('menasihati', 'VERB'),
 ('mereka', 'PRON'),
 ('supaya', 'SCONJ'),
 ('berhenti', 'VERB'),
 ('berehat', 'VERB'),
 ('dan', 'CCONJ'),
 ('tidur', 'VERB'),
 ('sebentar', 'ADV'),
 ('sekiranya', 'SCONJ'),
 ('mengantuk', 'ADJ'),
 ('ketika', 'SCONJ'),
 ('memandu.', 'VERB')]

[ ]:

Part-of-Speech Recognition

Contents

Part-of-Speech Recognition#

Describe supported POS#

List available Transformer POS models#

Load Transformer model#

Load Quantized model#

Predict#

Group similar tags#

Vectorize#

Voting stack model#