Constituency Parsing#

This tutorial is available as an IPython notebook at Malaya/example/constituency.

This module only trained on standard language structure, so it is not save to use it for local language structure.

[1]:

%%time

import malaya

CPU times: user 2.83 s, sys: 3.71 s, total: 6.54 s
Wall time: 2.13 s

/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

what is constituency parsing#

Assign a sentence into its own syntactic structure, defined by certain standardization. For example,

[2]:

from IPython.core.display import Image, display

display(Image('constituency.png', width=500))

Read more at Stanford notes, https://web.stanford.edu/~jurafsky/slp3/13.pdf

The context free grammar totally depends on language, so for Bahasa, we follow https://github.com/aisingapore/seacorenlp-data/tree/main/id/constituency

List available HuggingFace Constituency models#

[2]:

malaya.constituency.available_huggingface

[2]:

{'mesolitica/constituency-parsing-t5-small-standard-bahasa-cased': {'Size (MB)': 247,
  'Recall': 81.62,
  'Precision': 83.32,
  'FScore': 82.46,
  'CompleteMatch': 22.4,
  'TaggingAccuracy': 94.95},
 'mesolitica/constituency-parsing-t5-base-standard-bahasa-cased': {'Size (MB)': 545,
  'Recall': 82.23,
  'Precision': 82.12,
  'FScore': 82.18,
  'CompleteMatch': 23.5,
  'TaggingAccuracy': 94.69}}

[3]:

malaya.constituency.info

[3]:

'Tested on https://github.com/aisingapore/seacorenlp-data/tree/main/id/constituency test set.'

[4]:

string = 'Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'

Load HuggingFace model#

def huggingface(
    model: str = 'mesolitica/constituency-parsing-t5-small-standard-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to Constituency parsing.

    Parameters
    ----------
    model: str, optional (default='mesolitica/constituency-parsing-t5-small-standard-bahasa-cased')
        Check available models at `malaya.constituency.available_huggingface`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.Constituency
    """

[11]:

model = malaya.constituency.huggingface('mesolitica/constituency-parsing-t5-small-standard-bahasa-cased')

Parse#

def predict(self, string):
    """
    Parse a string into malaya.function.constituency.trees_newline.InternalParseNode.

    Parameters
    ----------
    string : str

    Returns
    -------
    result: malaya.function.constituency.trees_newline.InternalParseNode object
    """

[6]:

r = model.predict(string)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

Parse into NLTK Tree#

Make sure you already installed nltk, if not, simply,

pip install nltk svgling

[7]:

from nltk.tree import Tree
import svgling

[8]:

tree = Tree.fromstring(r.convert().linearize())

[10]:

svgling.draw_tree(tree)

[10]:

Constituency Parsing

Contents

Constituency Parsing#

what is constituency parsing#

List available HuggingFace Constituency models#

Load HuggingFace model#

Parse#

Parse into NLTK Tree#