Constituency Parsing

This tutorial is available as an IPython notebook at Malaya/example/constituency.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:
%%time

import malaya
CPU times: user 5.92 s, sys: 1.6 s, total: 7.52 s
Wall time: 11.5 s

what is constituency parsing

Assign a sentence into its own syntactic structure, defined by certain standardization. For example,

[2]:
from IPython.core.display import Image, display

display(Image('constituency.png', width=500))
_images/load-constituency_5_0.png

Read more at Stanford notes, https://web.stanford.edu/~jurafsky/slp3/13.pdf

The context free grammar totally depends on language, so for Bahasa, we follow https://github.com/famrashel/idn-treebank

List available transformer Constituency models

[2]:
malaya.constituency.available_transformer()
INFO:root:tested on 20% test set.
[2]:
Size (MB) Quantized Size (MB) Recall Precision FScore CompleteMatch TaggingAccuracy
bert 470.0 118.0 78.96 81.78 80.35 10.37 91.59
tiny-bert 125.0 31.8 74.89 78.79 76.79 9.01 91.17
albert 180.0 45.7 77.57 80.50 79.01 5.77 90.30
tiny-albert 56.7 14.5 67.21 74.89 70.84 2.11 87.75
xlnet 498.0 126.0 81.52 85.18 83.31 11.71 91.71

Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/Accuracy.html#constituency-parsing

The best model in term of accuracy is XLNET.

[3]:
string = 'Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'

Load xlnet constituency model

[5]:
model = malaya.constituency.transformer(model = 'xlnet')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:73: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:75: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:68: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

Load Quantized model

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[4]:
quantized_model = malaya.constituency.transformer(model = 'xlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

Parse into NLTK Tree

Make sure you already installed nltk, if not, simply,

pip install nltk

We preferred to parse into NLTK tree, so we can play around with children / subtrees.

[10]:
tree = model.parse_nltk_tree(string)
[11]:
tree
[11]:
_images/load-constituency_17_0.png
[5]:
tree = quantized_model.parse_nltk_tree(string)
[6]:
tree
[6]:
_images/load-constituency_19_0.png

Parse into Tree

This is a simple Tree object defined at malaya.text.trees.

[6]:
tree = model.parse_tree(string)

Vectorize

Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str):
    """
    vectorize a string.

    Parameters
    ----------
    string: List[str]

    Returns
    -------
    result: np.array
    """
[5]:
r = quantized_model.vectorize(string)
[7]:
x = [i[0] for i in r]
y = [i[1] for i in r]
[9]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(y)
tsne.shape
[9]:
(14, 2)
[10]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
):
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
    )
    plt.annotate(
        label,
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
    )
_images/load-constituency_26_0.png
[ ]: