Language Detection#

This tutorial is available as an IPython notebook at Malaya/example/language-detection.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]:
%%time
import malaya
import fasttext
CPU times: user 3.13 s, sys: 2.83 s, total: 5.96 s
Wall time: 2.71 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

labels supported#

Default labels for language detection module.

[2]:
for k, v in malaya.language_detection.metrics.items():
    print(k, v)
deep-model
              precision    recall  f1-score   support

         eng    0.96760   0.97401   0.97080    553739
         ind    0.97635   0.96131   0.96877    576059
       malay    0.96985   0.98498   0.97736   1800649
    manglish    0.98036   0.96569   0.97297    181442
       other    0.99641   0.99627   0.99634   1428083
       rojak    0.94221   0.84302   0.88986    189678

    accuracy                        0.97779   4729650
   macro avg    0.97213   0.95421   0.96268   4729650
weighted avg    0.97769   0.97779   0.97760   4729650

mesolitica/fasttext-language-detection-v1
              precision    recall  f1-score   support

         eng    0.94014   0.96750   0.95362    553739
         ind    0.97290   0.97316   0.97303    576059
       malay    0.98674   0.95262   0.96938   1800649
    manglish    0.96595   0.98417   0.97498    181442
       other    0.98454   0.99698   0.99072   1428083
       rojak    0.81149   0.91650   0.86080    189678

    accuracy                        0.97002   4729650
   macro avg    0.94363   0.96515   0.95375   4729650
weighted avg    0.97111   0.97002   0.97028   4729650

mesolitica/fasttext-language-detection-v2
                        precision    recall  f1-score   support

         local-english    0.88328   0.87926   0.88127     50429
           local-malay    0.93159   0.92648   0.92903     59877
        local-mandarin    0.62000   0.95044   0.75045     49820
              manglish    0.98494   0.98157   0.98325     49648
                 other    0.99168   0.92850   0.95905     64350
socialmedia-indonesian    0.97626   0.95390   0.96495     75140
      standard-english    0.86918   0.88018   0.87465     49776
   standard-indonesian    0.99695   0.99713   0.99704     50148
        standard-malay    0.92292   0.94851   0.93554     50049
     standard-mandarin    0.90855   0.53587   0.67413     53709

              accuracy                        0.89953    552946
             macro avg    0.90853   0.89818   0.89494    552946
          weighted avg    0.91425   0.89953   0.89893    552946

mesolitica/fasttext-language-detection-ms-id
                        precision    recall  f1-score   support

           local-malay    0.95063   0.93858   0.94457    199961
                 other    0.97145   0.98889   0.98009    125920
socialmedia-indonesian    0.97923   0.96303   0.97106    213486
   standard-indonesian    0.99119   0.99610   0.99364    149055
        standard-malay    0.93743   0.95669   0.94696    149336

              accuracy                        0.96584    837758
             macro avg    0.96599   0.96866   0.96727    837758
          weighted avg    0.96591   0.96584   0.96582    837758

mesolitica/fasttext-language-detection-en
                  precision    recall  f1-score   support

   local-english    0.88991   0.89457   0.89223    149823
        manglish    0.98619   0.98479   0.98549    149535
           other    0.99439   0.99268   0.99354    140651
standard-english    0.89162   0.88967   0.89064    150703

        accuracy                        0.93952    590712
       macro avg    0.94053   0.94043   0.94047    590712
    weighted avg    0.93960   0.93952   0.93955    590712

Different models support different languages.

List available language detection models#

[3]:
malaya.language_detection.available_fasttext
[3]:
{'mesolitica/fasttext-language-detection-v1': {'Size (MB)': 353,
  'Quantized Size (MB)': 31.1,
  'dim': 16,
  'Label': {0: 'eng',
   1: 'ind',
   2: 'malay',
   3: 'manglish',
   4: 'other',
   5: 'rojak'}},
 'mesolitica/fasttext-language-detection-v2': {'Size (MB)': 1840,
  'Quantized Size (MB)': 227,
  'dim': 16,
  'Label': {0: 'standard-english',
   1: 'local-english',
   2: 'manglish',
   3: 'standard-indonesian',
   4: 'socialmedia-indonesian',
   5: 'standard-malay',
   6: 'local-malay',
   7: 'standard-mandarin',
   8: 'local-mandarin',
   9: 'other'}},
 'mesolitica/fasttext-language-detection-ms-id': {'Size (MB)': 537,
  'Quantized Size (MB)': 62.5,
  'dim': 16,
  'Label': {0: 'standard-indonesian',
   1: 'socialmedia-indonesian',
   2: 'standard-malay',
   3: 'local-malay',
   4: 'other'}},
 'mesolitica/fasttext-language-detection-bahasa-en': {'Size (MB)': 537,
  'Quantized Size (MB)': 62.5,
  'dim': 16,
  'Label': {0: 'bahasa', 1: 'english', 2: 'other'}},
 'mesolitica/fasttext-language-detection-en': {'Size (MB)': 383,
  'Quantized Size (MB)': 42.3,
  'dim': 16,
  'Label': {0: 'standard-english',
   1: 'local-english',
   2: 'manglish',
   3: 'other'}}}
[4]:
chinese_text = '今天是6月18号,也是Muiriel的生日!'
english_text = 'i totally love it man'
indon_text = 'menjabat saleh perombakan menjabat periode komisi energi fraksi partai pengurus partai periode periode partai terpilih periode menjabat komisi perdagangan investasi persatuan periode'
malay_text = 'beliau berkata program Inisitif Peduli Rakyat (IPR) yang diperkenalkan oleh kerajaan negeri Selangor lebih besar sumbangannya'
socialmedia_malay_text = 'nti aku tengok dulu tiket dari kl pukul berapa ada nahh'
socialmedia_indon_text = 'saking kangen papanya pas vc anakku nangis'
rojak_text = 'jadi aku tadi bikin ini gengs dan dijual haha salad only k dan haha drinks only k'
manglish_text = 'power lah even shopback come to edmw riao'

Load Fast-text model#

Make sure fast-text already installed, if not, simply,

pip install fasttext
def fasttext(quantized: bool = True, **kwargs):

    """
    Load Fasttext language detection model.
    Original size is 353MB, Quantized size 31.1MB.

    Parameters
    ----------
    quantized: bool, optional (default=True)
        if True, load quantized fasttext model. Else, load original fasttext model.

    Returns
    -------
    result : malaya.model.ml.LanguageDetection class
    """

In this example, I am going to compare with pretrained fasttext from Facebook. https://fasttext.cc/docs/en/language-identification.html

Simply download pretrained model,

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz
[5]:
# !wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz
[22]:
model = fasttext.load_model('lid.176.ftz')
fast_text = malaya.language_detection.fasttext()

Language detection in Malaya is not trying to tackle possible languages in this world, just towards to hyperlocal language.

[7]:
model.predict(['suka makan ayam dan daging'])
[7]:
([['__label__id']], [array([0.6334154], dtype=float32)])
[8]:
fast_text.predict_proba(['suka makan ayam dan daging'])
[8]:
[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.50445783,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]
[9]:
model.predict(malay_text)
[9]:
(('__label__ms',), array([0.57101035]))
[10]:
fast_text.predict_proba([malay_text])
[10]:
[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.9099521,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]
[11]:
model.predict(socialmedia_malay_text)
[11]:
(('__label__id',), array([0.7870034]))
[12]:
fast_text.predict_proba([socialmedia_malay_text])
[12]:
[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.0,
  'local-malay': 0.9976433,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]
[13]:
model.predict(socialmedia_indon_text)
[13]:
(('__label__fr',), array([0.2912012]))
[14]:
fast_text.predict_proba([socialmedia_indon_text])
[14]:
[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 1.00003,
  'standard-malay': 0.0,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]
[15]:
model.predict(rojak_text)
[15]:
(('__label__id',), array([0.87948251]))
[16]:
fast_text.predict_proba([rojak_text])
[16]:
[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.0,
  'local-malay': 0.9569701,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]
[17]:
model.predict(manglish_text)
[17]:
(('__label__en',), array([0.89707506]))
[18]:
fast_text.predict_proba([manglish_text])
[18]:
[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.99997073,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.0,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]
[19]:
model.predict(chinese_text)
[19]:
(('__label__zh',), array([0.97311586]))
[20]:
fast_text.predict_proba([chinese_text])
[20]:
[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.0,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.5823944,
  'other': 0.0}]
[21]:
fast_text.predict_proba([indon_text,malay_text])
[21]:
[{'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.9755073,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.0,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0},
 {'standard-english': 0.0,
  'local-english': 0.0,
  'manglish': 0.0,
  'standard-indonesian': 0.0,
  'socialmedia-indonesian': 0.0,
  'standard-malay': 0.9099521,
  'local-malay': 0.0,
  'standard-mandarin': 0.0,
  'local-mandarin': 0.0,
  'other': 0.0}]