Constituency Parsing#

This tutorial is available as an IPython notebook at Malaya/example/constituency.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:
import logging

logging.basicConfig(level=logging.INFO)
[2]:
%%time

import malaya
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
CPU times: user 5.73 s, sys: 1.12 s, total: 6.85 s
Wall time: 7.73 s

what is constituency parsing#

Assign a sentence into its own syntactic structure, defined by certain standardization. For example,

[2]:
from IPython.core.display import Image, display

display(Image('constituency.png', width=500))
_images/load-constituency_6_0.png

Read more at Stanford notes, https://web.stanford.edu/~jurafsky/slp3/13.pdf

The context free grammar totally depends on language, so for Bahasa, we follow https://github.com/famrashel/idn-treebank

List available transformer Constituency models#

[3]:
malaya.constituency.available_transformer()
INFO:malaya.constituency:tested on test set at https://github.com/huseinzol05/malaya/blob/master/session/constituency/download-data.ipynb
[3]:
Size (MB) Quantized Size (MB) Recall Precision FScore CompleteMatch TaggingAccuracy
bert 470.0 118.0 78.96 81.78 80.35 10.37 91.59
tiny-bert 125.0 31.8 74.89 78.79 76.79 9.01 91.17
albert 180.0 45.7 77.57 80.50 79.01 5.77 90.30
tiny-albert 56.7 14.5 67.21 74.89 70.84 2.11 87.75
xlnet 498.0 126.0 81.52 85.18 83.31 11.71 91.71
[3]:
string = 'Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'

Load xlnet constituency model#

def transformer(model: str = 'xlnet', quantized: bool = False, **kwargs):
    """
    Load Transformer Constituency Parsing model, transfer learning Transformer + self attentive parsing.

    Parameters
    ----------
    model : str, optional (default='bert')
        Model architecture supported. Allowed values:

        * ``'bert'`` - Google BERT BASE parameters.
        * ``'tiny-bert'`` - Google BERT TINY parameters.
        * ``'albert'`` - Google ALBERT BASE parameters.
        * ``'tiny-albert'`` - Google ALBERT TINY parameters.
        * ``'xlnet'`` - Google XLNET BASE parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya.model.tf.Constituency class
    """
[5]:
model = malaya.constituency.transformer(model = 'xlnet')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:73: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:75: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:68: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

Load Quantized model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[4]:
quantized_model = malaya.constituency.transformer(model = 'xlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

Parse into NLTK Tree#

Make sure you already installed nltk, if not, simply,

pip install nltk

We preferred to parse into NLTK tree, so we can play around with children / subtrees.

def parse_nltk_tree(self, string: str):

    """
    Parse a string into NLTK Tree, to make it useful, make sure you already installed tktinker.

    Parameters
    ----------
    string : str

    Returns
    -------
    result: nltk.Tree object
    """
[10]:
tree = model.parse_nltk_tree(string)
[11]:
tree
[11]:
_images/load-constituency_17_0.png
[5]:
tree = quantized_model.parse_nltk_tree(string)
[6]:
tree
[6]:
_images/load-constituency_19_0.png

Parse into Tree#

This is a simple Tree object defined at malaya.text.trees.

def parse_tree(self, string):

    """
    Parse a string into string treebank format.

    Parameters
    ----------
    string : str

    Returns
    -------
    result: malaya.text.trees.InternalTreebankNode class
    """
[6]:
tree = model.parse_tree(string)

Vectorize#

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str):
    """
    vectorize a string.

    Parameters
    ----------
    string: List[str]

    Returns
    -------
    result: np.array
    """
[5]:
r = quantized_model.vectorize(string)
[7]:
x = [i[0] for i in r]
y = [i[1] for i in r]
[9]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(y)
tsne.shape
[9]:
(14, 2)
[10]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-constituency_26_0.png
[ ]: