Constituency Parsing


import malaya
what is constituency parsing

Assign a sentence into its own syntactic structure, defined by certain standardization. For example,

from IPython.core.display import Image, display

display(Image('constituency.png', width=500))

Read more at Stanford notes,

The context free grammar totally depends on language, so for Bahasa, we follow

List available transformer Constituency models

{'bert': ['470.0 MB',
  'Recall: 78.96',
  'Precision: 81.78',
  'FScore: 80.35',
  'CompleteMatch: 10.37',
  'TaggingAccuracy: 91.59'],
 'tiny-bert': ['125 MB',
  'Recall: 74.89',
  'Precision: 78.79',
  'FScore: 76.79',
  'CompleteMatch: 9.01',
  'TaggingAccuracy: 91.17'],
 'albert': ['180.0 MB',
  'Recall: 77.57',
  'Precision: 80.50',
  'FScore: 79.01',
  'CompleteMatch: 5.77',
  'TaggingAccuracy: 90.30'],
 'tiny-albert': ['56.7 MB',
  'Recall: 67.21',
  'Precision: 74.89',
  'FScore: 70.84',
  'CompleteMatch: 2.11',
  'TaggingAccuracy: 87.75'],
 'xlnet': ['498.0 MB',
  'Recall: 80.65',
  'Precision: 82.22',
  'FScore: 81.43',
  'CompleteMatch: 11.08',
  'TaggingAccuracy: 92.12']}

Make sure you can check accuracy chart from here first before select a model,

The best model in term of accuracy is XLNET.

string = 'Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'

Load xlnet constituency model

model = malaya.constituency.transformer(model = 'xlnet')
Parse into NLTK Tree

Make sure you already installed nltk, if not, simply,

pip install nltk

We preferred to parse into NLTK tree, so we can play around with children / subtrees.

tree = model.parse_nltk_tree(string)

Parse into Tree

This is a simple Tree object defined at malaya.text.trees.

tree = model.parse_tree(string)