Dependency Parsing¶
This tutorial is available as an IPython notebook at Malaya/example/dependency.
This module only trained on standard language structure, so it is not save to use it for local language structure.
[1]:
%%time
import malaya
CPU times: user 4.67 s, sys: 658 ms, total: 5.33 s
Wall time: 4.62 s
Describe supported dependencies¶
[2]:
malaya.dependency.describe()
INFO:root:you can read more from https://universaldependencies.org/treebanks/id_pud/index.html
[2]:
Tag | Description | |
---|---|---|
0 | acl | clausal modifier of noun |
1 | advcl | adverbial clause modifier |
2 | advmod | adverbial modifier |
3 | amod | adjectival modifier |
4 | appos | appositional modifier |
5 | aux | auxiliary |
6 | case | case marking |
7 | ccomp | clausal complement |
8 | advmod | adverbial modifier |
9 | compound | compound |
10 | compound:plur | plural compound |
11 | conj | conjunct |
12 | cop | cop |
13 | csubj | clausal subject |
14 | dep | dependent |
15 | det | determiner |
16 | fixed | multi-word expression |
17 | flat | name |
18 | iobj | indirect object |
19 | mark | marker |
20 | nmod | nominal modifier |
21 | nsubj | nominal subject |
22 | obj | direct object |
23 | parataxis | parataxis |
24 | root | root |
25 | xcomp | open clausal complement |
[3]:
string = 'Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'
List available transformer Dependency models¶
[4]:
malaya.dependency.available_transformer()
INFO:root:tested on 20% test set.
[4]:
Size (MB) | Quantized Size (MB) | Arc Accuracy | Types Accuracy | Root Accuracy | |
---|---|---|---|---|---|
bert | 426.0 | 112.0 | 0.855 | 0.848 | 0.920 |
tiny-bert | 59.5 | 15.7 | 0.718 | 0.694 | 0.886 |
albert | 50.0 | 13.2 | 0.811 | 0.793 | 0.879 |
tiny-albert | 24.8 | 6.6 | 0.708 | 0.673 | 0.817 |
xlnet | 450.2 | 119.0 | 0.931 | 0.925 | 0.947 |
alxlnet | 50.0 | 14.3 | 0.894 | 0.886 | 0.942 |
Make sure you can check accuracy chart from here first before select a model, https://malaya.readthedocs.io/en/latest/Accuracy.html#dependency-parsing
The best model in term of accuracy is XLNET.
Load xlnet dependency model¶
[5]:
model = malaya.dependency.transformer(model = 'xlnet')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:54: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:55: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:49: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
Load Quantized model¶
To load 8-bit quantized model, simply pass quantized = True
, default is False
.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[10]:
quantized_model = malaya.dependency.transformer(model = 'xlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
[6]:
d_object, tagging, indexing = model.predict(string)
d_object.to_graphvis()
[6]:
[11]:
d_object, tagging, indexing = quantized_model.predict(string)
d_object.to_graphvis()
[11]:
Voting stack model¶
[8]:
alxlnet = malaya.dependency.transformer(model = 'alxlnet')
tagging, indexing = malaya.stack.voting_stack([model, alxlnet, model], string)
malaya.dependency.dependency_graph(tagging, indexing).to_graphvis()
downloading frozen /Users/huseinzolkepli/Malaya/dependency/alxlnet/base model
51.0MB [00:50, 1.01MB/s]
[8]:
Dependency graph object¶
To initiate a dependency graph from dependency models, you need to call malaya.dependency.dependency_graph
.
[9]:
graph = malaya.dependency.dependency_graph(tagging, indexing)
graph
[9]:
<malaya.function.parse_dependency.DependencyGraph at 0x164e67e90>
generate graphvis¶
[10]:
graph.to_graphvis()
[10]:
Get nodes¶
[11]:
graph.nodes
[11]:
defaultdict(<function malaya.function.parse_dependency.DependencyGraph.__init__.<locals>.<lambda>()>,
{0: {'address': 0,
'word': None,
'lemma': None,
'ctag': 'TOP',
'tag': 'TOP',
'feats': None,
'head': None,
'deps': defaultdict(list, {'root': [3]}),
'rel': None},
1: {'address': 1,
'word': 'Dr',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 3,
'deps': defaultdict(list, {'flat': [2]}),
'rel': 'nsubj'},
3: {'address': 3,
'word': 'menasihati',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 0,
'deps': defaultdict(list,
{'nsubj': [1], 'obj': [4], 'ccomp': [6]}),
'rel': 'root'},
2: {'address': 2,
'word': 'Mahathir',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 1,
'deps': defaultdict(list, {}),
'rel': 'flat'},
4: {'address': 4,
'word': 'mereka',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 3,
'deps': defaultdict(list, {}),
'rel': 'obj'},
5: {'address': 5,
'word': 'supaya',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 6,
'deps': defaultdict(list, {}),
'rel': 'case'},
6: {'address': 6,
'word': 'berhenti',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 3,
'deps': defaultdict(list,
{'case': [5], 'ccomp': [7], 'conj': [9]}),
'rel': 'ccomp'},
7: {'address': 7,
'word': 'berehat',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 6,
'deps': defaultdict(list, {}),
'rel': 'ccomp'},
8: {'address': 8,
'word': 'dan',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 9,
'deps': defaultdict(list, {}),
'rel': 'cc'},
9: {'address': 9,
'word': 'tidur',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 6,
'deps': defaultdict(list,
{'cc': [8],
'advmod': [10],
'amod': [12],
'advcl': [14]}),
'rel': 'conj'},
10: {'address': 10,
'word': 'sebentar',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 9,
'deps': defaultdict(list, {}),
'rel': 'advmod'},
11: {'address': 11,
'word': 'sekiranya',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 12,
'deps': defaultdict(list, {}),
'rel': 'advmod'},
12: {'address': 12,
'word': 'mengantuk',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 9,
'deps': defaultdict(list, {'advmod': [11]}),
'rel': 'amod'},
13: {'address': 13,
'word': 'ketika',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 14,
'deps': defaultdict(list, {}),
'rel': 'case'},
14: {'address': 14,
'word': 'memandu.',
'lemma': '_',
'ctag': '_',
'tag': '_',
'feats': '_',
'head': 9,
'deps': defaultdict(list, {'case': [13]}),
'rel': 'advcl'}})
Flat the graph¶
[12]:
list(graph.triples())
[12]:
[(('menasihati', '_'), 'nsubj', ('Dr', '_')),
(('Dr', '_'), 'flat', ('Mahathir', '_')),
(('menasihati', '_'), 'obj', ('mereka', '_')),
(('menasihati', '_'), 'ccomp', ('berhenti', '_')),
(('berhenti', '_'), 'case', ('supaya', '_')),
(('berhenti', '_'), 'ccomp', ('berehat', '_')),
(('berhenti', '_'), 'conj', ('tidur', '_')),
(('tidur', '_'), 'cc', ('dan', '_')),
(('tidur', '_'), 'advmod', ('sebentar', '_')),
(('tidur', '_'), 'amod', ('mengantuk', '_')),
(('mengantuk', '_'), 'advmod', ('sekiranya', '_')),
(('tidur', '_'), 'advcl', ('memandu.', '_')),
(('memandu.', '_'), 'case', ('ketika', '_'))]
Check the graph contains cycles¶
[13]:
graph.contains_cycle()
[13]:
False
Generate networkx¶
Make sure you already installed networkx,
pip install networkx
[14]:
digraph = graph.to_networkx()
digraph
[14]:
<networkx.classes.multidigraph.MultiDiGraph at 0x1a875a110>
[15]:
import networkx as nx
import matplotlib.pyplot as plt
nx.draw_networkx(digraph)
plt.show()
<Figure size 640x480 with 1 Axes>
[16]:
digraph.edges()
[16]:
OutMultiEdgeDataView([(1, 3), (2, 1), (4, 3), (5, 6), (6, 3), (7, 6), (8, 9), (9, 6), (10, 9), (11, 12), (12, 9), (13, 14), (14, 9)])
[17]:
digraph.nodes()
[17]:
NodeView((1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
[18]:
labels = {i:graph.get_by_address(i)['word'] for i in digraph.nodes()}
labels
[18]:
{1: 'Dr',
2: 'Mahathir',
3: 'menasihati',
4: 'mereka',
5: 'supaya',
6: 'berhenti',
7: 'berehat',
8: 'dan',
9: 'tidur',
10: 'sebentar',
11: 'sekiranya',
12: 'mengantuk',
13: 'ketika',
14: 'memandu.'}
[19]:
plt.figure(figsize=(15,5))
nx.draw_networkx(digraph,labels=labels)
plt.show()

Vectorize¶
Let say you want to visualize word level in lower dimension, you can use model.vectorize
,
def vectorize(self, string: str):
"""
vectorize a string.
Parameters
----------
string: List[str]
Returns
-------
result: np.array
"""
[6]:
r = quantized_model.vectorize(string)
[7]:
x = [i[0] for i in r]
y = [i[1] for i in r]
[8]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE().fit_transform(y)
tsne.shape
[8]:
(14, 2)
[9]:
plt.figure(figsize = (7, 7))
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = x
for label, x, y in zip(
labels, tsne[:, 0], tsne[:, 1]
):
label = (
'%s, %.3f' % (label[0], label[1])
if isinstance(label, list)
else label
)
plt.annotate(
label,
xy = (x, y),
xytext = (0, 0),
textcoords = 'offset points',
)

[ ]: