Syllable tokenizer
Contents
Syllable tokenizer#
This tutorial is available as an IPython notebook at Malaya/example/tokenizer-syllable.
This module only suitable for standard language structure, so it is not save to use it for local language structure.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
[2]:
%%time
import malaya
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CPU times: user 3.2 s, sys: 2.88 s, total: 6.08 s
Wall time: 2.56 s
/home/husein/ssd3/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/ssd3/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
Load rules based syllable tokenizer#
def rules(**kwargs):
"""
Load rules based syllable tokenizer.
originally from https://github.com/fahadh4ilyas/syllable_splitter/blob/master/SyllableSplitter.py
- improved `cuaca` double vocal `ua` based on https://en.wikipedia.org/wiki/Comparison_of_Indonesian_and_Standard_Malay#Syllabification
- improved `rans` double consonant `ns` based on https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1
- improved `au` and `ai` double vocal.
Returns
-------
result: malaya.syllable.Tokenizer class
"""
[3]:
tokenizer = malaya.syllable.rules()
Tokenize#
def tokenize(self, string):
"""
Tokenize string into multiple strings using syllable patterns.
Example from https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1/figure/0,
'cuaca' -> ['cua', 'ca']
'insurans' -> ['in', 'su', 'rans']
'praktikal' -> ['prak', 'ti', 'kal']
'strategi' -> ['stra', 'te', 'gi']
'ayam' -> ['a', 'yam']
'anda' -> ['an', 'da']
'hantu' -> ['han', 'tu']
Parameters
----------
string : str
Returns
-------
result: List[str]
"""
[4]:
tokenizer.tokenize('angan-angan')
/home/husein/ssd3/malaya/malaya/model/syllable.py:46: FutureWarning: Possible nested set at position 3
or re.findall(_expressions['ic'], word.lower())
[4]:
['a', 'ngan', '-', 'a', 'ngan']
[5]:
tokenizer.tokenize('cuaca')
[5]:
['cua', 'ca']
[6]:
tokenizer.tokenize('hidup')
[6]:
['hi', 'dup']
[7]:
tokenizer.tokenize('insuran')
[7]:
['in', 'su', 'ran']
[8]:
tokenizer.tokenize('insurans')
[8]:
['in', 'su', 'rans']
[9]:
tokenizer.tokenize('ayam')
[9]:
['a', 'yam']
[10]:
tokenizer.tokenize('strategi')
[10]:
['stra', 'te', 'gi']
[11]:
tokenizer.tokenize('hantu')
[11]:
['han', 'tu']
[12]:
tokenizer.tokenize('hello')
[12]:
['hel', 'lo']
Better performance#
Split by words and tokenize it.
[13]:
string = 'sememang-memangnya kau sakai siot'
[14]:
results = []
for w in string.split():
results.extend(tokenizer.tokenize(w))
results
[14]:
['se', 'me', 'mang', '-', 'me', 'mang', 'nya', 'kau', 'sa', 'kai', 'siot']
List available HuggingFace models#
We are also provide syllable tokenizer using deep learning, trained on DBP dataset.
[15]:
malaya.syllable.available_huggingface
[15]:
{'mesolitica/syllable-lstm': {'Size (MB)': 35.2,
'hidden size': 512,
'CER': 0.011996584781229728,
'WER': 0.06915983606557377}}
Load deep learning model#
[16]:
model = malaya.syllable.huggingface()
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tokenize#
def tokenize(self, string, beam_search: bool = False):
"""
Tokenize string into multiple strings using deep learning.
Parameters
----------
string : str
beam_search : bool, (optional=False)
If True, use beam search decoder, else use greedy decoder.
Returns
-------
result: List[str]
"""
[17]:
model.tokenize('angan-angan')
[17]:
['a', 'ngan', 'a', 'ngan']
[18]:
model.tokenize('insuran')
[18]:
['in', 'su', 'ran']
[19]:
model.tokenize('insurans')
[19]:
['in', 'sur', 'ans']
Harder example#
Test set from DBP at https://huggingface.co/datasets/mesolitica/syllable/raw/main/test-syllable.json
[20]:
import requests
import json
r = requests.get('https://huggingface.co/datasets/mesolitica/syllable/raw/main/test-syllable.json')
test_set = r.json()
[21]:
len(test_set)
[21]:
1952
[22]:
def calculate_wer(actual, hyp):
"""
Calculate WER using `python-Levenshtein`.
"""
import Levenshtein as Lev
b = set(actual.split() + hyp.split())
word2char = dict(zip(b, range(len(b))))
w1 = [chr(word2char[w]) for w in actual.split()]
w2 = [chr(word2char[w]) for w in hyp.split()]
return Lev.distance(''.join(w1), ''.join(w2)) / len(actual.split())
[23]:
wers = []
for test in test_set:
t = tokenizer.tokenize(test[0])
t = [t_ for t_ in t if t_ not in ['-']]
wer = calculate_wer(test[1], '.'.join(t))
wers.append(wer)
sum(wers) / len(wers)
[23]:
0.09016393442622951
[24]:
for test in test_set[:50]:
print('original:', test[0])
print('actual:', test[1].split('.'))
t = tokenizer.tokenize(test[0])
print('predicted:', t)
print()
original: mengilukan
actual: ['me', 'ngi', 'lu', 'kan']
predicted: ['me', 'ngi', 'lu', 'kan']
original: menjongkok
actual: ['men', 'jong', 'kok']
predicted: ['men', 'jong', 'kok']
original: tergabas
actual: ['ter', 'ga', 'bas']
predicted: ['ter', 'ga', 'bas']
original: perunding
actual: ['pe', 'run', 'ding']
predicted: ['pe', 'run', 'ding']
original: kemahalan
actual: ['ke', 'ma', 'ha', 'lan']
predicted: ['ke', 'ma', 'ha', 'lan']
original: renggang
actual: ['reng', 'gang']
predicted: ['reng', 'gang']
original: bersuci
actual: ['ber', 'su', 'ci']
predicted: ['ber', 'su', 'ci']
original: jelebat
actual: ['je', 'le', 'bat']
predicted: ['je', 'le', 'bat']
original: rekod
actual: ['re', 'kod']
predicted: ['re', 'kod']
original: amang
actual: ['a', 'mang']
predicted: ['a', 'mang']
original: aromaterapi
actual: ['a', 'ro', 'ma', 'te', 'ra', 'pi']
predicted: ['a', 'ro', 'ma', 'te', 'ra', 'pi']
original: pengkompaunan
actual: ['peng', 'kom', 'pau', 'nan']
predicted: ['peng', 'kom', 'pau', 'nan']
original: payah
actual: ['pa', 'yah']
predicted: ['pa', 'yah']
original: menghargai
actual: ['meng', 'har', 'ga', 'i']
predicted: ['meng', 'har', 'gai']
original: keterpaksaan
actual: ['ke', 'ter', 'pak', 'sa', 'an']
predicted: ['ke', 'ter', 'pak', 'sa', 'an']
original: kerempagi
actual: ['ke', 'rem', 'pa', 'gi']
predicted: ['ke', 'rem', 'pa', 'gi']
original: pengancaman
actual: ['pe', 'ngan', 'ca', 'man']
predicted: ['pe', 'ngan', 'ca', 'man']
original: kedwilogaman
actual: ['ke', 'dwi', 'lo', 'ga', 'man']
predicted: ['ked', 'wi', 'lo', 'ga', 'man']
original: copeng
actual: ['co', 'peng']
predicted: ['co', 'peng']
original: antienzim
actual: ['an', 'ti', 'en', 'zim']
predicted: ['an', 'ti', 'en', 'zim']
original: angkar
actual: ['ang', 'kar']
predicted: ['ang', 'kar']
original: menjembak
actual: ['men', 'jem', 'bak']
predicted: ['men', 'jem', 'bak']
original: tanggah
actual: ['tang', 'gah']
predicted: ['tang', 'gah']
original: berjujuk
actual: ['ber', 'ju', 'juk']
predicted: ['ber', 'ju', 'juk']
original: nestapa
actual: ['nes', 'ta', 'pa']
predicted: ['nes', 'ta', 'pa']
original: engku
actual: ['eng', 'ku']
predicted: ['eng', 'ku']
original: undang-undang
actual: ['un', 'dang', 'un', 'dang']
predicted: ['un', 'dang', '-', 'un', 'dang']
original: tiket
actual: ['ti', 'ket']
predicted: ['ti', 'ket']
original: janin
actual: ['ja', 'nin']
predicted: ['ja', 'nin']
original: pakuk
actual: ['pa', 'kuk']
predicted: ['pa', 'kuk']
original: betika
actual: ['be', 'ti', 'ka']
predicted: ['be', 'ti', 'ka']
original: nangoi
actual: ['na', 'ngoi']
predicted: ['na', 'ngo', 'i']
original: mulato
actual: ['mu', 'la', 'to']
predicted: ['mu', 'la', 'to']
original: peruasan
actual: ['pe', 'rua', 'san']
predicted: ['pe', 'rua', 'san']
original: terkajang
actual: ['ter', 'ka', 'jang']
predicted: ['ter', 'ka', 'jang']
original: menjanda
actual: ['men', 'jan', 'da']
predicted: ['men', 'jan', 'da']
original: menautkan
actual: ['me', 'naut', 'kan']
predicted: ['me', 'naut', 'kan']
original: khalayak
actual: ['kha', 'la', 'yak']
predicted: ['kha', 'la', 'yak']
original: lena
actual: ['le', 'na']
predicted: ['le', 'na']
original: kesurupan
actual: ['ke', 'su', 'ru', 'pan']
predicted: ['ke', 'su', 'ru', 'pan']
original: meneriak-neriakkan
actual: ['me', 'ne', 'riak', 'ne', 'riak', 'kan']
predicted: ['me', 'ne', 'ri', 'ak', '-', 'ne', 'ri', 'ak', 'kan']
original: bergumpal
actual: ['ber', 'gum', 'pal']
predicted: ['ber', 'gum', 'pal']
original: rodat
actual: ['ro', 'dat']
predicted: ['ro', 'dat']
original: sepukal
actual: ['se', 'pu', 'kal']
predicted: ['se', 'pu', 'kal']
original: kerani
actual: ['ke', 'ra', 'ni']
predicted: ['ke', 'ra', 'ni']
original: mewahyukan
actual: ['me', 'wah', 'yu', 'kan']
predicted: ['me', 'wah', 'yu', 'kan']
original: berprestij
actual: ['ber', 'pres', 'tij']
predicted: ['ber', 'pres', 'tij']
original: dingin
actual: ['di', 'ngin']
predicted: ['di', 'ngin']
original: lipas
actual: ['li', 'pas']
predicted: ['li', 'pas']
original: berdingkit-dingkit
actual: ['ber', 'ding', 'kit', 'ding', 'kit']
predicted: ['ber', 'ding', 'kit', '-', 'ding', 'kit']
[26]:
wers = []
for test in test_set:
t = model.tokenize(test[0])
t = [t_ for t_ in t if t_ not in ['-']]
wer = calculate_wer(test[1], '.'.join(t))
wers.append(wer)
sum(wers) / len(wers)
[26]:
0.0630122950819672
[25]:
for test in test_set[:50]:
print('original:', test[0])
print('actual:', test[1].split('.'))
t = model.tokenize(test[0])
print('predicted:', t)
print()
original: mengilukan
actual: ['me', 'ngi', 'lu', 'kan']
predicted: ['me', 'ngi', 'lu', 'kan']
original: menjongkok
actual: ['men', 'jong', 'kok']
predicted: ['men', 'jong', 'kok']
original: tergabas
actual: ['ter', 'ga', 'bas']
predicted: ['ter', 'ga', 'bas']
original: perunding
actual: ['pe', 'run', 'ding']
predicted: ['pe', 'run', 'ding']
original: kemahalan
actual: ['ke', 'ma', 'ha', 'lan']
predicted: ['ke', 'ma', 'ha', 'lan']
original: renggang
actual: ['reng', 'gang']
predicted: ['reng', 'gang']
original: bersuci
actual: ['ber', 'su', 'ci']
predicted: ['ber', 'su', 'ci']
original: jelebat
actual: ['je', 'le', 'bat']
predicted: ['je', 'le', 'bat']
original: rekod
actual: ['re', 'kod']
predicted: ['re', 'kod']
original: amang
actual: ['a', 'mang']
predicted: ['a', 'mang']
original: aromaterapi
actual: ['a', 'ro', 'ma', 'te', 'ra', 'pi']
predicted: ['a', 'ro', 'ma', 'te', 'ra', 'pi']
original: pengkompaunan
actual: ['peng', 'kom', 'pau', 'nan']
predicted: ['peng', 'kom', 'pau', 'nan']
original: payah
actual: ['pa', 'yah']
predicted: ['pa', 'yah']
original: menghargai
actual: ['meng', 'har', 'ga', 'i']
predicted: ['meng', 'har', 'ga', 'i']
original: keterpaksaan
actual: ['ke', 'ter', 'pak', 'sa', 'an']
predicted: ['ke', 'ter', 'pak', 'sa', 'an']
original: kerempagi
actual: ['ke', 'rem', 'pa', 'gi']
predicted: ['ke', 'rem', 'pa', 'gi']
original: pengancaman
actual: ['pe', 'ngan', 'ca', 'man']
predicted: ['pe', 'ngan', 'ca', 'man']
original: kedwilogaman
actual: ['ke', 'dwi', 'lo', 'ga', 'man']
predicted: ['ke', 'dwi', 'lo', 'ga', 'man']
original: copeng
actual: ['co', 'peng']
predicted: ['co', 'peng']
original: antienzim
actual: ['an', 'ti', 'en', 'zim']
predicted: ['an', 'tien', 'zim']
original: angkar
actual: ['ang', 'kar']
predicted: ['ang', 'kar']
original: menjembak
actual: ['men', 'jem', 'bak']
predicted: ['men', 'jem', 'bak']
original: tanggah
actual: ['tang', 'gah']
predicted: ['tang', 'gah']
original: berjujuk
actual: ['ber', 'ju', 'juk']
predicted: ['ber', 'ju', 'juk']
original: nestapa
actual: ['nes', 'ta', 'pa']
predicted: ['nes', 'ta', 'pa']
original: engku
actual: ['eng', 'ku']
predicted: ['eng', 'ku']
original: undang-undang
actual: ['un', 'dang', 'un', 'dang']
predicted: ['un', 'dang', 'un', 'dang']
original: tiket
actual: ['ti', 'ket']
predicted: ['ti', 'ket']
original: janin
actual: ['ja', 'nin']
predicted: ['ja', 'nin']
original: pakuk
actual: ['pa', 'kuk']
predicted: ['pa', 'kuk']
original: betika
actual: ['be', 'ti', 'ka']
predicted: ['be', 'ti', 'ka']
original: nangoi
actual: ['na', 'ngoi']
predicted: ['na', 'ngo', 'i']
original: mulato
actual: ['mu', 'la', 'to']
predicted: ['mu', 'la', 'to']
original: peruasan
actual: ['pe', 'rua', 'san']
predicted: ['pe', 'rua', 'san']
original: terkajang
actual: ['ter', 'ka', 'jang']
predicted: ['ter', 'ka', 'jang']
original: menjanda
actual: ['men', 'jan', 'da']
predicted: ['men', 'jan', 'da']
original: menautkan
actual: ['me', 'naut', 'kan']
predicted: ['me', 'naut', 'kan']
original: khalayak
actual: ['kha', 'la', 'yak']
predicted: ['kha', 'la', 'yak']
original: lena
actual: ['le', 'na']
predicted: ['le', 'na']
original: kesurupan
actual: ['ke', 'su', 'ru', 'pan']
predicted: ['ke', 'su', 'ru', 'pan']
original: meneriak-neriakkan
actual: ['me', 'ne', 'riak', 'ne', 'riak', 'kan']
predicted: ['me', 'ne', 'riak', 'ne', 'riak', 'kan']
original: bergumpal
actual: ['ber', 'gum', 'pal']
predicted: ['ber', 'gum', 'pal']
original: rodat
actual: ['ro', 'dat']
predicted: ['ro', 'dat']
original: sepukal
actual: ['se', 'pu', 'kal']
predicted: ['se', 'pu', 'kal']
original: kerani
actual: ['ke', 'ra', 'ni']
predicted: ['ke', 'ra', 'ni']
original: mewahyukan
actual: ['me', 'wah', 'yu', 'kan']
predicted: ['me', 'wah', 'yu', 'kan']
original: berprestij
actual: ['ber', 'pres', 'tij']
predicted: ['ber', 'pres', 'tij']
original: dingin
actual: ['di', 'ngin']
predicted: ['di', 'ngin']
original: lipas
actual: ['li', 'pas']
predicted: ['li', 'pas']
original: berdingkit-dingkit
actual: ['ber', 'ding', 'kit', 'ding', 'kit']
predicted: ['ber', 'ding', 'kit', 'ding', 'kit']
[ ]: