Constituency Parsing
Contents
Constituency Parsing#
This tutorial is available as an IPython notebook at Malaya/example/constituency.
This module only trained on standard language structure, so it is not save to use it for local language structure.
[1]:
import logging
logging.basicConfig(level=logging.INFO)
[2]:
%%time
import malaya
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
CPU times: user 5.73 s, sys: 1.12 s, total: 6.85 s
Wall time: 7.73 s
what is constituency parsing#
Assign a sentence into its own syntactic structure, defined by certain standardization. For example,
[2]:
from IPython.core.display import Image, display
display(Image('constituency.png', width=500))

Read more at Stanford notes, https://web.stanford.edu/~jurafsky/slp3/13.pdf
The context free grammar totally depends on language, so for Bahasa, we follow https://github.com/famrashel/idn-treebank
List available transformer Constituency models#
[3]:
malaya.constituency.available_transformer()
INFO:malaya.constituency:tested on test set at https://github.com/huseinzol05/malaya/blob/master/session/constituency/download-data.ipynb
[3]:
Size (MB) | Quantized Size (MB) | Recall | Precision | FScore | CompleteMatch | TaggingAccuracy | |
---|---|---|---|---|---|---|---|
bert | 470.0 | 118.0 | 78.96 | 81.78 | 80.35 | 10.37 | 91.59 |
tiny-bert | 125.0 | 31.8 | 74.89 | 78.79 | 76.79 | 9.01 | 91.17 |
albert | 180.0 | 45.7 | 77.57 | 80.50 | 79.01 | 5.77 | 90.30 |
tiny-albert | 56.7 | 14.5 | 67.21 | 74.89 | 70.84 | 2.11 | 87.75 |
xlnet | 498.0 | 126.0 | 81.52 | 85.18 | 83.31 | 11.71 | 91.71 |
[3]:
string = 'Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'
Load xlnet constituency model#
def transformer(model: str = 'xlnet', quantized: bool = False, **kwargs):
"""
Load Transformer Constituency Parsing model, transfer learning Transformer + self attentive parsing.
Parameters
----------
model : str, optional (default='bert')
Model architecture supported. Allowed values:
* ``'bert'`` - Google BERT BASE parameters.
* ``'tiny-bert'`` - Google BERT TINY parameters.
* ``'albert'`` - Google ALBERT BASE parameters.
* ``'tiny-albert'`` - Google ALBERT TINY parameters.
* ``'xlnet'`` - Google XLNET BASE parameters.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya.model.tf.Constituency class
"""
[5]:
model = malaya.constituency.transformer(model = 'xlnet')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:73: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:75: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:68: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
Load Quantized model#
To load 8-bit quantized model, simply pass quantized = True
, default is False
.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[4]:
quantized_model = malaya.constituency.transformer(model = 'xlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
Parse into NLTK Tree#
Make sure you already installed nltk
, if not, simply,
pip install nltk
We preferred to parse into NLTK tree, so we can play around with children / subtrees.
def parse_nltk_tree(self, string: str):
"""
Parse a string into NLTK Tree, to make it useful, make sure you already installed tktinker.
Parameters
----------
string : str
Returns
-------
result: nltk.Tree object
"""
[10]:
tree = model.parse_nltk_tree(string)
[11]:
tree
[11]:

[5]:
tree = quantized_model.parse_nltk_tree(string)
[6]:
tree
[6]:

Parse into Tree#
This is a simple Tree object defined at malaya.text.trees.
def parse_tree(self, string):
"""
Parse a string into string treebank format.
Parameters
----------
string : str
Returns
-------
result: malaya.text.trees.InternalTreebankNode class
"""
[6]:
tree = model.parse_tree(string)
Vectorize#
Let say you want to visualize word level in lower dimension, you can use model.vectorize
,
def vectorize(self, string: str):
"""
vectorize a string.
Parameters
----------
string: List[str]
Returns
-------
result: np.array
"""
[5]:
r = quantized_model.vectorize(string)
[7]:
x = [i[0] for i in r]
y = [i[1] for i in r]
[9]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE().fit_transform(y)
tsne.shape
[9]:
(14, 2)
[10]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
labels, tsne[:, 0], tsne[:, 1]
):
label = (
'%s, %.3f' % (label[0], label[1])
if isinstance(label, list)
else label
)
plt.annotate(
label,
xy = (x, y),
xytext = (0, 0),
textcoords = 'offset points',
)

[ ]: