Language Detection
Contents
Language Detection#
This tutorial is available as an IPython notebook at Malaya/example/language-detection.
This module trained on both standard and local (included social media) language structures, so it is save to use for both.
[1]:
%%time
import malaya
import fasttext
CPU times: user 3.13 s, sys: 2.83 s, total: 5.96 s
Wall time: 2.71 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
labels supported#
Default labels for language detection module.
[2]:
for k, v in malaya.language_detection.metrics.items():
print(k, v)
deep-model
precision recall f1-score support
eng 0.96760 0.97401 0.97080 553739
ind 0.97635 0.96131 0.96877 576059
malay 0.96985 0.98498 0.97736 1800649
manglish 0.98036 0.96569 0.97297 181442
other 0.99641 0.99627 0.99634 1428083
rojak 0.94221 0.84302 0.88986 189678
accuracy 0.97779 4729650
macro avg 0.97213 0.95421 0.96268 4729650
weighted avg 0.97769 0.97779 0.97760 4729650
mesolitica/fasttext-language-detection-v1
precision recall f1-score support
eng 0.94014 0.96750 0.95362 553739
ind 0.97290 0.97316 0.97303 576059
malay 0.98674 0.95262 0.96938 1800649
manglish 0.96595 0.98417 0.97498 181442
other 0.98454 0.99698 0.99072 1428083
rojak 0.81149 0.91650 0.86080 189678
accuracy 0.97002 4729650
macro avg 0.94363 0.96515 0.95375 4729650
weighted avg 0.97111 0.97002 0.97028 4729650
mesolitica/fasttext-language-detection-v2
precision recall f1-score support
local-english 0.88328 0.87926 0.88127 50429
local-malay 0.93159 0.92648 0.92903 59877
local-mandarin 0.62000 0.95044 0.75045 49820
manglish 0.98494 0.98157 0.98325 49648
other 0.99168 0.92850 0.95905 64350
socialmedia-indonesian 0.97626 0.95390 0.96495 75140
standard-english 0.86918 0.88018 0.87465 49776
standard-indonesian 0.99695 0.99713 0.99704 50148
standard-malay 0.92292 0.94851 0.93554 50049
standard-mandarin 0.90855 0.53587 0.67413 53709
accuracy 0.89953 552946
macro avg 0.90853 0.89818 0.89494 552946
weighted avg 0.91425 0.89953 0.89893 552946
mesolitica/fasttext-language-detection-ms-id
precision recall f1-score support
local-malay 0.95063 0.93858 0.94457 199961
other 0.97145 0.98889 0.98009 125920
socialmedia-indonesian 0.97923 0.96303 0.97106 213486
standard-indonesian 0.99119 0.99610 0.99364 149055
standard-malay 0.93743 0.95669 0.94696 149336
accuracy 0.96584 837758
macro avg 0.96599 0.96866 0.96727 837758
weighted avg 0.96591 0.96584 0.96582 837758
mesolitica/fasttext-language-detection-en
precision recall f1-score support
local-english 0.88991 0.89457 0.89223 149823
manglish 0.98619 0.98479 0.98549 149535
other 0.99439 0.99268 0.99354 140651
standard-english 0.89162 0.88967 0.89064 150703
accuracy 0.93952 590712
macro avg 0.94053 0.94043 0.94047 590712
weighted avg 0.93960 0.93952 0.93955 590712
Different models support different languages.
List available language detection models#
[3]:
malaya.language_detection.available_fasttext
[3]:
{'mesolitica/fasttext-language-detection-v1': {'Size (MB)': 353,
'Quantized Size (MB)': 31.1,
'dim': 16,
'Label': {0: 'eng',
1: 'ind',
2: 'malay',
3: 'manglish',
4: 'other',
5: 'rojak'}},
'mesolitica/fasttext-language-detection-v2': {'Size (MB)': 1840,
'Quantized Size (MB)': 227,
'dim': 16,
'Label': {0: 'standard-english',
1: 'local-english',
2: 'manglish',
3: 'standard-indonesian',
4: 'socialmedia-indonesian',
5: 'standard-malay',
6: 'local-malay',
7: 'standard-mandarin',
8: 'local-mandarin',
9: 'other'}},
'mesolitica/fasttext-language-detection-ms-id': {'Size (MB)': 537,
'Quantized Size (MB)': 62.5,
'dim': 16,
'Label': {0: 'standard-indonesian',
1: 'socialmedia-indonesian',
2: 'standard-malay',
3: 'local-malay',
4: 'other'}},
'mesolitica/fasttext-language-detection-bahasa-en': {'Size (MB)': 537,
'Quantized Size (MB)': 62.5,
'dim': 16,
'Label': {0: 'bahasa', 1: 'english', 2: 'other'}},
'mesolitica/fasttext-language-detection-en': {'Size (MB)': 383,
'Quantized Size (MB)': 42.3,
'dim': 16,
'Label': {0: 'standard-english',
1: 'local-english',
2: 'manglish',
3: 'other'}}}
[4]:
chinese_text = '今天是6月18号,也是Muiriel的生日!'
english_text = 'i totally love it man'
indon_text = 'menjabat saleh perombakan menjabat periode komisi energi fraksi partai pengurus partai periode periode partai terpilih periode menjabat komisi perdagangan investasi persatuan periode'
malay_text = 'beliau berkata program Inisitif Peduli Rakyat (IPR) yang diperkenalkan oleh kerajaan negeri Selangor lebih besar sumbangannya'
socialmedia_malay_text = 'nti aku tengok dulu tiket dari kl pukul berapa ada nahh'
socialmedia_indon_text = 'saking kangen papanya pas vc anakku nangis'
rojak_text = 'jadi aku tadi bikin ini gengs dan dijual haha salad only k dan haha drinks only k'
manglish_text = 'power lah even shopback come to edmw riao'
Load Fast-text model#
Make sure fast-text already installed, if not, simply,
pip install fasttext
def fasttext(quantized: bool = True, **kwargs):
"""
Load Fasttext language detection model.
Original size is 353MB, Quantized size 31.1MB.
Parameters
----------
quantized: bool, optional (default=True)
if True, load quantized fasttext model. Else, load original fasttext model.
Returns
-------
result : malaya.model.ml.LanguageDetection class
"""
In this example, I am going to compare with pretrained fasttext from Facebook. https://fasttext.cc/docs/en/language-identification.html
Simply download pretrained model,
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz
[5]:
# !wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz
[22]:
model = fasttext.load_model('lid.176.ftz')
fast_text = malaya.language_detection.fasttext()
Language detection in Malaya is not trying to tackle possible languages in this world, just towards to hyperlocal language.
[7]:
model.predict(['suka makan ayam dan daging'])
[7]:
([['__label__id']], [array([0.6334154], dtype=float32)])
[8]:
fast_text.predict_proba(['suka makan ayam dan daging'])
[8]:
[{'standard-english': 0.0,
'local-english': 0.0,
'manglish': 0.0,
'standard-indonesian': 0.0,
'socialmedia-indonesian': 0.0,
'standard-malay': 0.50445783,
'local-malay': 0.0,
'standard-mandarin': 0.0,
'local-mandarin': 0.0,
'other': 0.0}]
[9]:
model.predict(malay_text)
[9]:
(('__label__ms',), array([0.57101035]))
[10]:
fast_text.predict_proba([malay_text])
[10]:
[{'standard-english': 0.0,
'local-english': 0.0,
'manglish': 0.0,
'standard-indonesian': 0.0,
'socialmedia-indonesian': 0.0,
'standard-malay': 0.9099521,
'local-malay': 0.0,
'standard-mandarin': 0.0,
'local-mandarin': 0.0,
'other': 0.0}]
[11]:
model.predict(socialmedia_malay_text)
[11]:
(('__label__id',), array([0.7870034]))
[12]:
fast_text.predict_proba([socialmedia_malay_text])
[12]:
[{'standard-english': 0.0,
'local-english': 0.0,
'manglish': 0.0,
'standard-indonesian': 0.0,
'socialmedia-indonesian': 0.0,
'standard-malay': 0.0,
'local-malay': 0.9976433,
'standard-mandarin': 0.0,
'local-mandarin': 0.0,
'other': 0.0}]
[13]:
model.predict(socialmedia_indon_text)
[13]:
(('__label__fr',), array([0.2912012]))
[14]:
fast_text.predict_proba([socialmedia_indon_text])
[14]:
[{'standard-english': 0.0,
'local-english': 0.0,
'manglish': 0.0,
'standard-indonesian': 0.0,
'socialmedia-indonesian': 1.00003,
'standard-malay': 0.0,
'local-malay': 0.0,
'standard-mandarin': 0.0,
'local-mandarin': 0.0,
'other': 0.0}]
[15]:
model.predict(rojak_text)
[15]:
(('__label__id',), array([0.87948251]))
[16]:
fast_text.predict_proba([rojak_text])
[16]:
[{'standard-english': 0.0,
'local-english': 0.0,
'manglish': 0.0,
'standard-indonesian': 0.0,
'socialmedia-indonesian': 0.0,
'standard-malay': 0.0,
'local-malay': 0.9569701,
'standard-mandarin': 0.0,
'local-mandarin': 0.0,
'other': 0.0}]
[17]:
model.predict(manglish_text)
[17]:
(('__label__en',), array([0.89707506]))
[18]:
fast_text.predict_proba([manglish_text])
[18]:
[{'standard-english': 0.0,
'local-english': 0.0,
'manglish': 0.99997073,
'standard-indonesian': 0.0,
'socialmedia-indonesian': 0.0,
'standard-malay': 0.0,
'local-malay': 0.0,
'standard-mandarin': 0.0,
'local-mandarin': 0.0,
'other': 0.0}]
[19]:
model.predict(chinese_text)
[19]:
(('__label__zh',), array([0.97311586]))
[20]:
fast_text.predict_proba([chinese_text])
[20]:
[{'standard-english': 0.0,
'local-english': 0.0,
'manglish': 0.0,
'standard-indonesian': 0.0,
'socialmedia-indonesian': 0.0,
'standard-malay': 0.0,
'local-malay': 0.0,
'standard-mandarin': 0.0,
'local-mandarin': 0.5823944,
'other': 0.0}]
[21]:
fast_text.predict_proba([indon_text,malay_text])
[21]:
[{'standard-english': 0.0,
'local-english': 0.0,
'manglish': 0.0,
'standard-indonesian': 0.9755073,
'socialmedia-indonesian': 0.0,
'standard-malay': 0.0,
'local-malay': 0.0,
'standard-mandarin': 0.0,
'local-mandarin': 0.0,
'other': 0.0},
{'standard-english': 0.0,
'local-english': 0.0,
'manglish': 0.0,
'standard-indonesian': 0.0,
'socialmedia-indonesian': 0.0,
'standard-malay': 0.9099521,
'local-malay': 0.0,
'standard-mandarin': 0.0,
'local-mandarin': 0.0,
'other': 0.0}]