Constituency Parsing#

This module only trained on standard language structure, so it is not save to use it for local language structure.

what is constituency parsing#

Assign a sentence into its own syntactic structure, defined by certain standardization. For example,

Read more at Stanford notes,

The context free grammar totally depends on language, so for Bahasa, we follow

List available transformer Constituency models#

INFO:malaya.constituency:tested on test set at
Size (MB) Quantized Size (MB) Recall Precision FScore CompleteMatch TaggingAccuracy
bert 470.0 118.0 78.96 81.78 80.35 10.37 91.59
tiny-bert 125.0 31.8 74.89 78.79 76.79 9.01 91.17
albert 180.0 45.7 77.57 80.50 79.01 5.77 90.30
tiny-albert 56.7 14.5 67.21 74.89 70.84 2.11 87.75
xlnet 498.0 126.0 81.52 85.18 83.31 11.71 91.71
string = 'Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'

Load xlnet constituency model#

def transformer(model: str = 'xlnet', quantized: bool = False, **kwargs):
    Load Transformer Constituency Parsing model, transfer learning Transformer + self attentive parsing.

    model : str, optional (default='bert')
        Model architecture supported. Allowed values:

        * ``'bert'`` - Google BERT BASE parameters.
        * ``'tiny-bert'`` - Google BERT TINY parameters.
        * ``'albert'`` - Google ALBERT BASE parameters.
        * ``'tiny-albert'`` - Google ALBERT TINY parameters.
        * ``'xlnet'`` - Google XLNET BASE parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    result : class
model = malaya.constituency.transformer(model = 'xlnet')
Load Quantized model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

quantized_model = malaya.constituency.transformer(model = 'xlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
Parse into NLTK Tree#

Make sure you already installed nltk, if not, simply,

pip install nltk

We preferred to parse into NLTK tree, so we can play around with children / subtrees.

def parse_nltk_tree(self, string: str):

    Parse a string into NLTK Tree, to make it useful, make sure you already installed tktinker.

    string : str

    result: nltk.Tree object
tree = model.parse_nltk_tree(string)
tree = quantized_model.parse_nltk_tree(string)

Parse into Tree#

This is a simple Tree object defined at malaya.text.trees.

def parse_tree(self, string):

    Parse a string into string treebank format.

    string : str

    result: malaya.text.trees.InternalTreebankNode class
tree = model.parse_tree(string)


Let say you want to visualize word level in lower dimension, you can use model.vectorize,

def vectorize(self, string: str):
    vectorize a string.

    string: List[str]

    result: np.array
r = quantized_model.vectorize(string)
x = [i[0] for i in r]
y = [i[1] for i in r]
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE().fit_transform(y)
(14, 2)
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
    labels, tsne[:, 0], tsne[:, 1]
    label = (
        '%s, %.3f' % (label[0], label[1])
        if isinstance(label, list)
        else label
        xy = (x, y),
        xytext = (0, 0),
        textcoords = 'offset points',
