{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Part-of-Speech Recognition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/part-of-speech](https://github.com/huseinzol05/Malaya/tree/master/example/part-of-speech).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module only trained on standard language structure, so it is not save to use it for local language structure.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.94 s, sys: 1.17 s, total: 7.11 s\n", "Wall time: 8.41 s\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Models accuracy\n", "\n", "We use `sklearn.metrics.classification_report` for accuracy reporting, check at https://malaya.readthedocs.io/en/latest/models-accuracy.html#pos-recognition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Describe supported POS" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TagDescription
0ADJAdjective, kata sifat
1ADPAdposition
2ADVAdverb, kata keterangan
3ADXAuxiliary verb, kata kerja tambahan
4CCONJCoordinating conjuction, kata hubung
5DETDeterminer, kata penentu
6NOUNNoun, kata nama
7NUMNumber, nombor
8PARTParticle
9PRONPronoun, kata ganti
10PROPNProper noun, kata ganti nama khas
11SCONJSubordinating conjunction
12SYMSymbol
13VERBVerb, kata kerja
14XOther
\n", "
" ], "text/plain": [ " Tag Description\n", "0 ADJ Adjective, kata sifat\n", "1 ADP Adposition\n", "2 ADV Adverb, kata keterangan\n", "3 ADX Auxiliary verb, kata kerja tambahan\n", "4 CCONJ Coordinating conjuction, kata hubung\n", "5 DET Determiner, kata penentu\n", "6 NOUN Noun, kata nama\n", "7 NUM Number, nombor\n", "8 PART Particle\n", "9 PRON Pronoun, kata ganti\n", "10 PROPN Proper noun, kata ganti nama khas\n", "11 SCONJ Subordinating conjunction\n", "12 SYM Symbol\n", "13 VERB Verb, kata kerja\n", "14 X Other" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.pos.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available Transformer POS models" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:tested on 20% test set.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)macro precisionmacro recallmacro f1-score
bert426.4111.000.932800.931290.93181
tiny-bert57.715.400.928100.926490.92704
albert48.712.800.931990.919480.92547
tiny-albert22.45.980.905790.895010.90002
xlnet446.6118.000.933030.932220.93236
alxlnet46.813.300.927320.930460.92819
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) macro precision macro recall \\\n", "bert 426.4 111.00 0.93280 0.93129 \n", "tiny-bert 57.7 15.40 0.92810 0.92649 \n", "albert 48.7 12.80 0.93199 0.91948 \n", "tiny-albert 22.4 5.98 0.90579 0.89501 \n", "xlnet 446.6 118.00 0.93303 0.93222 \n", "alxlnet 46.8 13.30 0.92732 0.93046 \n", "\n", " macro f1-score \n", "bert 0.93181 \n", "tiny-bert 0.92704 \n", "albert 0.92547 \n", "tiny-albert 0.90002 \n", "xlnet 0.93236 \n", "alxlnet 0.92819 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.pos.available_transformer()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Transformer model\n", "\n", "```python\n", "def transformer(model: str = 'xlnet', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load Transformer POS Tagging model, transfer learning Transformer + CRF.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='bert')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'bert'`` - Google BERT BASE parameters.\n", " * ``'tiny-bert'`` - Google BERT TINY parameters.\n", " * ``'albert'`` - Google ALBERT BASE parameters.\n", " * ``'tiny-albert'`` - Google ALBERT TINY parameters.\n", " * ``'xlnet'`` - Google XLNET BASE parameters.\n", " * ``'alxlnet'`` - Malaya ALXLNET BASE parameters.\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya.supervised.tag.transformer function\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:112: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:114: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:loading sentence piece model\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:loading sentence piece model\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:107: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.\n", "\n" ] } ], "source": [ "model = malaya.pos.transformer(model = 'albert')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:loading sentence piece model\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:loading sentence piece model\n" ] } ], "source": [ "quantized_model = malaya.pos.transformer(model = 'albert', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict\n", "\n", "```python\n", "def predict(self, string: str):\n", " \"\"\"\n", " Tag a string.\n", "\n", " Parameters\n", " ----------\n", " string : str\n", "\n", " Returns\n", " -------\n", " result: Tuple[str, str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('KUALA', 'PROPN'),\n", " ('LUMPUR:', 'PROPN'),\n", " ('Sempena', 'ADP'),\n", " ('sambutan', 'NOUN'),\n", " ('Aidilfitri', 'NOUN'),\n", " ('minggu', 'NOUN'),\n", " ('depan,', 'ADJ'),\n", " ('Perdana', 'PROPN'),\n", " ('Menteri', 'PROPN'),\n", " ('Tun', 'PROPN'),\n", " ('Dr', 'PROPN'),\n", " ('Mahathir', 'PROPN'),\n", " ('Mohamad', 'PROPN'),\n", " ('dan', 'CCONJ'),\n", " ('Menteri', 'PROPN'),\n", " ('Pengangkutan', 'PROPN'),\n", " ('Anthony', 'PROPN'),\n", " ('Loke', 'PROPN'),\n", " ('Siew', 'PROPN'),\n", " ('Fook', 'PROPN'),\n", " ('menitipkan', 'VERB'),\n", " ('pesanan', 'NOUN'),\n", " ('khas', 'ADJ'),\n", " ('kepada', 'ADP'),\n", " ('orang', 'NOUN'),\n", " ('ramai', 'ADJ'),\n", " ('yang', 'PRON'),\n", " ('mahu', 'ADV'),\n", " ('pulang', 'VERB'),\n", " ('ke', 'ADP'),\n", " ('kampung', 'NOUN'),\n", " ('halaman', 'NOUN'),\n", " ('masing-masing.', 'DET'),\n", " ('Dalam', 'ADP'),\n", " ('video', 'NOUN'),\n", " ('pendek', 'ADJ'),\n", " ('terbitan', 'NOUN'),\n", " ('Jabatan', 'PROPN'),\n", " ('Keselamatan', 'PROPN'),\n", " ('Jalan', 'PROPN'),\n", " ('Raya', 'PROPN'),\n", " ('(JKJR)', 'PUNCT'),\n", " ('itu,', 'DET'),\n", " ('Dr', 'PROPN'),\n", " ('Mahathir', 'PROPN'),\n", " ('menasihati', 'VERB'),\n", " ('mereka', 'PRON'),\n", " ('supaya', 'SCONJ'),\n", " ('berhenti', 'VERB'),\n", " ('berehat', 'VERB'),\n", " ('dan', 'CCONJ'),\n", " ('tidur', 'VERB'),\n", " ('sebentar', 'NOUN'),\n", " ('sekiranya', 'SCONJ'),\n", " ('mengantuk', 'ADJ'),\n", " ('ketika', 'SCONJ'),\n", " ('memandu.', 'VERB')]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(string)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[('KUALA', 'PROPN'),\n", " ('LUMPUR:', 'PROPN'),\n", " ('Sempena', 'ADP'),\n", " ('sambutan', 'NOUN'),\n", " ('Aidilfitri', 'NOUN'),\n", " ('minggu', 'NOUN'),\n", " ('depan,', 'ADJ'),\n", " ('Perdana', 'PROPN'),\n", " ('Menteri', 'PROPN'),\n", " ('Tun', 'PROPN'),\n", " ('Dr', 'PROPN'),\n", " ('Mahathir', 'PROPN'),\n", " ('Mohamad', 'PROPN'),\n", " ('dan', 'CCONJ'),\n", " ('Menteri', 'PROPN'),\n", " ('Pengangkutan', 'PROPN'),\n", " ('Anthony', 'PROPN'),\n", " ('Loke', 'PROPN'),\n", " ('Siew', 'PROPN'),\n", " ('Fook', 'PROPN'),\n", " ('menitipkan', 'VERB'),\n", " ('pesanan', 'NOUN'),\n", " ('khas', 'ADJ'),\n", " ('kepada', 'ADP'),\n", " ('orang', 'NOUN'),\n", " ('ramai', 'ADJ'),\n", " ('yang', 'PRON'),\n", " ('mahu', 'ADV'),\n", " ('pulang', 'VERB'),\n", " ('ke', 'ADP'),\n", " ('kampung', 'NOUN'),\n", " ('halaman', 'NOUN'),\n", " ('masing-masing.', 'DET'),\n", " ('Dalam', 'ADP'),\n", " ('video', 'NOUN'),\n", " ('pendek', 'ADJ'),\n", " ('terbitan', 'NOUN'),\n", " ('Jabatan', 'PROPN'),\n", " ('Keselamatan', 'PROPN'),\n", " ('Jalan', 'PROPN'),\n", " ('Raya', 'PROPN'),\n", " ('(JKJR)', 'PUNCT'),\n", " ('itu,', 'DET'),\n", " ('Dr', 'PROPN'),\n", " ('Mahathir', 'PROPN'),\n", " ('menasihati', 'VERB'),\n", " ('mereka', 'PRON'),\n", " ('supaya', 'SCONJ'),\n", " ('berhenti', 'VERB'),\n", " ('berehat', 'VERB'),\n", " ('dan', 'CCONJ'),\n", " ('tidur', 'VERB'),\n", " ('sebentar', 'NOUN'),\n", " ('sekiranya', 'SCONJ'),\n", " ('mengantuk', 'ADJ'),\n", " ('ketika', 'SCONJ'),\n", " ('memandu.', 'VERB')]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_model.predict(string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Group similar tags\n", "\n", "```python\n", "def analyze(self, string: str):\n", " \"\"\"\n", " Analyze a string.\n", "\n", " Parameters\n", " ----------\n", " string : str\n", "\n", " Returns\n", " -------\n", " result: {'words': List[str], 'tags': [{'text': 'text', 'type': 'location', 'score': 1.0, 'beginOffset': 0, 'endOffset': 1}]}\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'words': ['KUALA',\n", " 'LUMPUR:',\n", " 'Sempena',\n", " 'sambutan',\n", " 'Aidilfitri',\n", " 'minggu',\n", " 'depan,',\n", " 'Perdana',\n", " 'Menteri',\n", " 'Tun',\n", " 'Dr',\n", " 'Mahathir',\n", " 'Mohamad',\n", " 'dan',\n", " 'Menteri',\n", " 'Pengangkutan',\n", " 'Anthony',\n", " 'Loke',\n", " 'Siew',\n", " 'Fook',\n", " 'menitipkan',\n", " 'pesanan',\n", " 'khas',\n", " 'kepada',\n", " 'orang',\n", " 'ramai',\n", " 'yang',\n", " 'mahu',\n", " 'pulang',\n", " 'ke',\n", " 'kampung',\n", " 'halaman',\n", " 'masing-masing.',\n", " 'Dalam',\n", " 'video',\n", " 'pendek',\n", " 'terbitan',\n", " 'Jabatan',\n", " 'Keselamatan',\n", " 'Jalan',\n", " 'Raya',\n", " '(JKJR)',\n", " 'itu,',\n", " 'Dr',\n", " 'Mahathir',\n", " 'menasihati',\n", " 'mereka',\n", " 'supaya',\n", " 'berhenti',\n", " 'berehat',\n", " 'dan',\n", " 'tidur',\n", " 'sebentar',\n", " 'sekiranya',\n", " 'mengantuk',\n", " 'ketika',\n", " 'memandu.'],\n", " 'tags': [{'text': 'KUALA LUMPUR:',\n", " 'type': 'PROPN',\n", " 'score': 1.0,\n", " 'beginOffset': 0,\n", " 'endOffset': 1},\n", " {'text': 'Sempena',\n", " 'type': 'ADP',\n", " 'score': 1.0,\n", " 'beginOffset': 2,\n", " 'endOffset': 2},\n", " {'text': 'sambutan Aidilfitri minggu',\n", " 'type': 'NOUN',\n", " 'score': 1.0,\n", " 'beginOffset': 3,\n", " 'endOffset': 5},\n", " {'text': 'depan,',\n", " 'type': 'ADJ',\n", " 'score': 1.0,\n", " 'beginOffset': 6,\n", " 'endOffset': 6},\n", " {'text': 'Perdana Menteri Tun Dr Mahathir Mohamad',\n", " 'type': 'PROPN',\n", " 'score': 1.0,\n", " 'beginOffset': 7,\n", " 'endOffset': 12},\n", " {'text': 'dan',\n", " 'type': 'CCONJ',\n", " 'score': 1.0,\n", " 'beginOffset': 13,\n", " 'endOffset': 13},\n", " {'text': 'Menteri Pengangkutan Anthony Loke Siew Fook',\n", " 'type': 'PROPN',\n", " 'score': 1.0,\n", " 'beginOffset': 14,\n", " 'endOffset': 19},\n", " {'text': 'menitipkan',\n", " 'type': 'VERB',\n", " 'score': 1.0,\n", " 'beginOffset': 20,\n", " 'endOffset': 20},\n", " {'text': 'pesanan',\n", " 'type': 'NOUN',\n", " 'score': 1.0,\n", " 'beginOffset': 21,\n", " 'endOffset': 21},\n", " {'text': 'khas',\n", " 'type': 'ADJ',\n", " 'score': 1.0,\n", " 'beginOffset': 22,\n", " 'endOffset': 22},\n", " {'text': 'kepada',\n", " 'type': 'ADP',\n", " 'score': 1.0,\n", " 'beginOffset': 23,\n", " 'endOffset': 23},\n", " {'text': 'orang',\n", " 'type': 'NOUN',\n", " 'score': 1.0,\n", " 'beginOffset': 24,\n", " 'endOffset': 24},\n", " {'text': 'ramai',\n", " 'type': 'ADJ',\n", " 'score': 1.0,\n", " 'beginOffset': 25,\n", " 'endOffset': 25},\n", " {'text': 'yang',\n", " 'type': 'PRON',\n", " 'score': 1.0,\n", " 'beginOffset': 26,\n", " 'endOffset': 26},\n", " {'text': 'mahu',\n", " 'type': 'ADV',\n", " 'score': 1.0,\n", " 'beginOffset': 27,\n", " 'endOffset': 27},\n", " {'text': 'pulang',\n", " 'type': 'VERB',\n", " 'score': 1.0,\n", " 'beginOffset': 28,\n", " 'endOffset': 28},\n", " {'text': 'ke',\n", " 'type': 'ADP',\n", " 'score': 1.0,\n", " 'beginOffset': 29,\n", " 'endOffset': 29},\n", " {'text': 'kampung halaman',\n", " 'type': 'NOUN',\n", " 'score': 1.0,\n", " 'beginOffset': 30,\n", " 'endOffset': 31},\n", " {'text': 'masing-masing.',\n", " 'type': 'DET',\n", " 'score': 1.0,\n", " 'beginOffset': 32,\n", " 'endOffset': 32},\n", " {'text': 'Dalam',\n", " 'type': 'ADP',\n", " 'score': 1.0,\n", " 'beginOffset': 33,\n", " 'endOffset': 33},\n", " {'text': 'video',\n", " 'type': 'NOUN',\n", " 'score': 1.0,\n", " 'beginOffset': 34,\n", " 'endOffset': 34},\n", " {'text': 'pendek',\n", " 'type': 'ADJ',\n", " 'score': 1.0,\n", " 'beginOffset': 35,\n", " 'endOffset': 35},\n", " {'text': 'terbitan',\n", " 'type': 'NOUN',\n", " 'score': 1.0,\n", " 'beginOffset': 36,\n", " 'endOffset': 36},\n", " {'text': 'Jabatan Keselamatan Jalan Raya',\n", " 'type': 'PROPN',\n", " 'score': 1.0,\n", " 'beginOffset': 37,\n", " 'endOffset': 40},\n", " {'text': '(JKJR)',\n", " 'type': 'PUNCT',\n", " 'score': 1.0,\n", " 'beginOffset': 41,\n", " 'endOffset': 41},\n", " {'text': 'itu,',\n", " 'type': 'DET',\n", " 'score': 1.0,\n", " 'beginOffset': 42,\n", " 'endOffset': 42},\n", " {'text': 'Dr Mahathir',\n", " 'type': 'PROPN',\n", " 'score': 1.0,\n", " 'beginOffset': 43,\n", " 'endOffset': 44},\n", " {'text': 'menasihati',\n", " 'type': 'VERB',\n", " 'score': 1.0,\n", " 'beginOffset': 45,\n", " 'endOffset': 45},\n", " {'text': 'mereka',\n", " 'type': 'PRON',\n", " 'score': 1.0,\n", " 'beginOffset': 46,\n", " 'endOffset': 46},\n", " {'text': 'supaya',\n", " 'type': 'SCONJ',\n", " 'score': 1.0,\n", " 'beginOffset': 47,\n", " 'endOffset': 47},\n", " {'text': 'berhenti berehat',\n", " 'type': 'VERB',\n", " 'score': 1.0,\n", " 'beginOffset': 48,\n", " 'endOffset': 49},\n", " {'text': 'dan',\n", " 'type': 'CCONJ',\n", " 'score': 1.0,\n", " 'beginOffset': 50,\n", " 'endOffset': 50},\n", " {'text': 'tidur',\n", " 'type': 'VERB',\n", " 'score': 1.0,\n", " 'beginOffset': 51,\n", " 'endOffset': 51},\n", " {'text': 'sebentar',\n", " 'type': 'NOUN',\n", " 'score': 1.0,\n", " 'beginOffset': 52,\n", " 'endOffset': 52},\n", " {'text': 'sekiranya',\n", " 'type': 'SCONJ',\n", " 'score': 1.0,\n", " 'beginOffset': 53,\n", " 'endOffset': 53},\n", " {'text': 'mengantuk',\n", " 'type': 'ADJ',\n", " 'score': 1.0,\n", " 'beginOffset': 54,\n", " 'endOffset': 54},\n", " {'text': 'ketika',\n", " 'type': 'SCONJ',\n", " 'score': 1.0,\n", " 'beginOffset': 55,\n", " 'endOffset': 55}]}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.analyze(string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorize\n", "\n", "Let say you want to visualize word level in lower dimension, you can use `model.vectorize`,\n", "\n", "```python\n", "def vectorize(self, string: str):\n", " \"\"\"\n", " vectorize a string.\n", "\n", " Parameters\n", " ----------\n", " string: List[str]\n", "\n", " Returns\n", " -------\n", " result: np.array\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "strings = [string, \n", " 'Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais',\n", " 'contact Husein at husein.zol05@gmail.com',\n", " 'tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek']" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "r = [quantized_model.vectorize(string) for string in strings]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "x, y = [], []\n", "for row in r:\n", " x.extend([i[0] for i in row])\n", " y.extend([i[1] for i in row])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(108, 2)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.manifold import TSNE\n", "import matplotlib.pyplot as plt\n", "\n", "tsne = TSNE().fit_transform(y)\n", "tsne.shape" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize = (7, 7))\n", "plt.scatter(tsne[:, 0], tsne[:, 1])\n", "labels = x\n", "for label, x, y in zip(\n", " labels, tsne[:, 0], tsne[:, 1]\n", "):\n", " label = (\n", " '%s, %.3f' % (label[0], label[1])\n", " if isinstance(label, list)\n", " else label\n", " )\n", " plt.annotate(\n", " label,\n", " xy = (x, y),\n", " xytext = (0, 0),\n", " textcoords = 'offset points',\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pretty good, the model able to know cluster similar part-of-speech." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Voting stack model" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('KUALA', 'PROPN'),\n", " ('LUMPUR:', 'PROPN'),\n", " ('Sempena', 'ADP'),\n", " ('sambutan', 'NOUN'),\n", " ('Aidilfitri', 'PROPN'),\n", " ('minggu', 'NOUN'),\n", " ('depan,', 'ADJ'),\n", " ('Perdana', 'PROPN'),\n", " ('Menteri', 'PROPN'),\n", " ('Tun', 'PROPN'),\n", " ('Dr', 'PROPN'),\n", " ('Mahathir', 'PROPN'),\n", " ('Mohamad', 'PROPN'),\n", " ('dan', 'CCONJ'),\n", " ('Menteri', 'PROPN'),\n", " ('Pengangkutan', 'PROPN'),\n", " ('Anthony', 'PROPN'),\n", " ('Loke', 'PROPN'),\n", " ('Siew', 'PROPN'),\n", " ('Fook', 'PROPN'),\n", " ('menitipkan', 'VERB'),\n", " ('pesanan', 'NOUN'),\n", " ('khas', 'ADJ'),\n", " ('kepada', 'ADP'),\n", " ('orang', 'NOUN'),\n", " ('ramai', 'ADJ'),\n", " ('yang', 'PRON'),\n", " ('mahu', 'ADV'),\n", " ('pulang', 'VERB'),\n", " ('ke', 'ADP'),\n", " ('kampung', 'NOUN'),\n", " ('halaman', 'NOUN'),\n", " ('masing-masing.', 'ADV'),\n", " ('Dalam', 'ADP'),\n", " ('video', 'NOUN'),\n", " ('pendek', 'ADJ'),\n", " ('terbitan', 'NOUN'),\n", " ('Jabatan', 'NOUN'),\n", " ('Keselamatan', 'PROPN'),\n", " ('Jalan', 'PROPN'),\n", " ('Raya', 'PROPN'),\n", " ('(JKJR)', 'PUNCT'),\n", " ('itu,', 'DET'),\n", " ('Dr', 'PROPN'),\n", " ('Mahathir', 'PROPN'),\n", " ('menasihati', 'VERB'),\n", " ('mereka', 'PRON'),\n", " ('supaya', 'SCONJ'),\n", " ('berhenti', 'VERB'),\n", " ('berehat', 'VERB'),\n", " ('dan', 'CCONJ'),\n", " ('tidur', 'VERB'),\n", " ('sebentar', 'ADV'),\n", " ('sekiranya', 'SCONJ'),\n", " ('mengantuk', 'ADJ'),\n", " ('ketika', 'SCONJ'),\n", " ('memandu.', 'VERB')]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alxlnet = malaya.pos.transformer(model = 'alxlnet')\n", "malaya.stack.voting_stack([model, alxlnet, alxlnet], string)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 2 }