Constituency Parsing#

This tutorial is available as an IPython notebook at Malaya/example/constituency.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:
%%time

import malaya
CPU times: user 2.83 s, sys: 3.71 s, total: 6.54 s
Wall time: 2.13 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

what is constituency parsing#

Assign a sentence into its own syntactic structure, defined by certain standardization. For example,

[2]:
from IPython.core.display import Image, display

display(Image('constituency.png', width=500))
_images/load-constituency_5_0.png

Read more at Stanford notes, https://web.stanford.edu/~jurafsky/slp3/13.pdf

The context free grammar totally depends on language, so for Bahasa, we follow https://github.com/aisingapore/seacorenlp-data/tree/main/id/constituency

List available HuggingFace Constituency models#

[2]:
malaya.constituency.available_huggingface
[2]:
{'mesolitica/constituency-parsing-t5-small-standard-bahasa-cased': {'Size (MB)': 247,
  'Recall': 81.62,
  'Precision': 83.32,
  'FScore': 82.46,
  'CompleteMatch': 22.4,
  'TaggingAccuracy': 94.95},
 'mesolitica/constituency-parsing-t5-base-standard-bahasa-cased': {'Size (MB)': 545,
  'Recall': 82.23,
  'Precision': 82.12,
  'FScore': 82.18,
  'CompleteMatch': 23.5,
  'TaggingAccuracy': 94.69}}
[3]:
malaya.constituency.info
[3]:
'Tested on https://github.com/aisingapore/seacorenlp-data/tree/main/id/constituency test set.'
[4]:
string = 'Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'

Load HuggingFace model#

def huggingface(
    model: str = 'mesolitica/constituency-parsing-t5-small-standard-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to Constituency parsing.

    Parameters
    ----------
    model: str, optional (default='mesolitica/constituency-parsing-t5-small-standard-bahasa-cased')
        Check available models at `malaya.constituency.available_huggingface`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.Constituency
    """
[11]:
model = malaya.constituency.huggingface('mesolitica/constituency-parsing-t5-small-standard-bahasa-cased')

Parse#

def predict(self, string):
    """
    Parse a string into malaya.function.constituency.trees_newline.InternalParseNode.

    Parameters
    ----------
    string : str

    Returns
    -------
    result: malaya.function.constituency.trees_newline.InternalParseNode object
    """
[6]:
r = model.predict(string)
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

Parse into NLTK Tree#

Make sure you already installed nltk, if not, simply,

pip install nltk svgling
[7]:
from nltk.tree import Tree
import svgling
[8]:
tree = Tree.fromstring(r.convert().linearize())
[10]:
svgling.draw_tree(tree)
[10]:
_images/load-constituency_18_0.svg