Preprocessing

This tutorial is available as an IPython notebook at Malaya/example/preprocessing.

[1]:
%%time
import malaya
CPU times: user 4.73 s, sys: 664 ms, total: 5.39 s
Wall time: 4.38 s

Available rules

We know that social media texts from Twitter, Facebook and Instagram are very noisy and we want to clean as much as possible to make our machines understand the structure of sentence much better. In Malaya, we standardize our text preprocessing,

  1. Malaya can replace special words into tokens to reduce dimension curse. rm10k become <money>.

  2. Malaya can put tags for special words, #drmahathir become <hashtag> drmahathir </hashtag>.

  3. Malaya can expand english contractions.

  4. Malaya can translate english words to become bahasa malaysia words. Again, this translation is using dictionary, it will not understand semantically. Purpose of this translation just to standardize to become bahasa Malaysia.

  5. Stemming and lemmatizing, required stemmer object.

  6. Normalize elongated words, required speller object.

  7. Expand hashtags, #drmahathir become dr mahathir, required segmentation object.

These are defaults setting for preprocessing(),

def preprocessing(
    normalize = [
        'url',
        'email',
        'percent',
        'money',
        'phone',
        'user',
        'time',
        'date',
        'number',
    ],
    annotate = [
        'allcaps',
        'elongated',
        'repeated',
        'emphasis',
        'censored',
        'hashtag',
    ],
    lowercase = True,
    fix_unidecode = True,
    expand_hashtags = True,
    expand_english_contractions = True,
    translate_english_to_bm = True,
    remove_prefix_postfix = True,
    maxlen_segmenter = 20,
    speller = None,
):

normalize

Supported normalize,

  1. hashtag

  2. cashtag

  3. tag

  4. user

  5. emphasis

  6. censored

  7. acronym

  8. eastern_emoticons

  9. rest_emoticons

  10. emoji

  11. quotes

  12. percent

  13. repeat_puncts

  14. money

  15. email

  16. phone

  17. number

  18. allcaps

  19. url

  20. date

  21. time

You can check all supported list at malaya.preprocessing.get_normalize().

Example, if you set money and number, and input string is RM10k, the output is <money>.

annotate

Supported annotate,

  1. hashtag

  2. allcaps

  3. elongated

  4. repeated

  5. emphasis

  6. censored

Example, if you set hashtag, and input string is #drmahathir, the output is <hashtag> drmahathir </hashtag>.

[2]:
string_1 = 'CANT WAIT for the new season of #mahathirmohamad \(^o^)/!!! #davidlynch #tvseries :))), TAAAK SAAABAAR!!!'
string_2 = 'kecewanya #johndoe movie and it suuuuucks!!! WASTED RM10... rm10 #badmovies :/'
string_3 = "@husein:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! :-D http://sentimentsymposium.com/."
string_4 = 'aahhh, malasnye nak pegi keje harini #mondayblues'
string_5 = '#drmahathir #najibrazak #1malaysia #mahathirnajib'

Preprocessing Interface

def preprocessing(
    normalize: List[str] = [
        'url',
        'email',
        'percent',
        'money',
        'phone',
        'user',
        'time',
        'date',
        'number',
    ],
    annotate: List[str] = [
        'allcaps',
        'elongated',
        'repeated',
        'emphasis',
        'censored',
        'hashtag',
    ],
    lowercase: bool = True,
    fix_unidecode: bool = True,
    expand_english_contractions: bool = True,
    translate_english_to_bm: bool = True,
    speller = None,
    segmenter = None,
    stemmer = None,
    **kwargs,
):
    """
    Load Preprocessing class.

    Parameters
    ----------
    normalize: list
        normalizing tokens, can check all supported normalizing at `malaya.preprocessing.get_normalize()`.
    annotate: list
        annonate tokens <open></open>,
        only accept ['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'].
    lowercase: bool
    fix_unidecode: bool
    expand_english_contractions: bool
        expand english contractions
    translate_english_to_bm: bool
        translate english words to bahasa malaysia words
    speller: object
        spelling correction object, need to have a method `correct`
    segmenter: object
        segmentation object, need to have a method `segment`.
        If provide, it will expand hashtags, #mondayblues == monday blues
    stemmer: object
        stemmer object, need to have a method `stem`.
        If provide, it will stem or lemmatize the string.

    Returns
    -------
    result : malaya.preprocessing.PREPROCESSING class
    """

Load default paramaters

default parameters able to translate most of english to bahasa malaysia.

[3]:
%%time
preprocessing = malaya.preprocessing.preprocessing()
CPU times: user 115 ms, sys: 18.4 ms, total: 134 ms
Wall time: 134 ms
[4]:
%%time
' '.join(preprocessing.process(string_1))
CPU times: user 2.1 ms, sys: 19 µs, total: 2.12 ms
Wall time: 2.12 ms
[4]:
'<allcaps> tak boleh tunggu </allcaps> untuk yang baru musim daripada <hashtag> mahathirmohamad </hashtag> \\(^o^)/ ! <repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> taak <elongated> saabaar <elongated> </allcaps> ! <repeated>'
[5]:
%%time
' '.join(preprocessing.process(string_2))
CPU times: user 426 µs, sys: 3 µs, total: 429 µs
Wall time: 432 µs
[5]:
'kecewanya <hashtag> johndoe </hashtag> filem dan ia suucks <elongated> ! <repeated> <allcaps> dibazirkan </allcaps> <money> . <repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'
[6]:
%%time
' '.join(preprocessing.process(string_3))
CPU times: user 413 µs, sys: 0 ns, total: 413 µs
Wall time: 416 µs
[6]:
'<user> : boleh tidak tunggu untuk yang <date> <hashtag> sentimen </hashtag> talks ! <allcaps> yaay <elongated> </allcaps> ! <repeated> :-d <url>'
[7]:
%%time
' '.join(preprocessing.process(string_4))
CPU times: user 391 µs, sys: 0 ns, total: 391 µs
Wall time: 398 µs
[7]:
'aahh <elongated> , malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'
[8]:
%%time
' '.join(preprocessing.process(string_5))
CPU times: user 459 µs, sys: 12 µs, total: 471 µs
Wall time: 474 µs
[8]:
'<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathirnajib </hashtag>'

Load default paramaters with spelling correction to normalize elongated words.

We saw taak, saabaar and another elongated words are not the original words, so we can use spelling correction to normalize it.

[9]:
corrector = malaya.spell.probability()
[10]:
%%time
preprocessing = malaya.preprocessing.preprocessing(speller = corrector)
CPU times: user 85.5 ms, sys: 16.3 ms, total: 102 ms
Wall time: 101 ms
[11]:
%%time
' '.join(preprocessing.process(string_1))
CPU times: user 630 µs, sys: 7 µs, total: 637 µs
Wall time: 640 µs
[11]:
'<allcaps> tak boleh tunggu </allcaps> untuk yang baru musim daripada <hashtag> mahathirmohamad </hashtag> \\(^o^)/ ! <repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> tidak <elongated> sabar <elongated> </allcaps> ! <repeated>'
[12]:
%%time
' '.join(preprocessing.process(string_2))
CPU times: user 445 µs, sys: 3 µs, total: 448 µs
Wall time: 451 µs
[12]:
'kecewanya <hashtag> johndoe </hashtag> filem dan ia sucks <elongated> ! <repeated> <allcaps> dibazirkan </allcaps> <money> . <repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'
[13]:
%%time
' '.join(preprocessing.process(string_3))
CPU times: user 640 µs, sys: 12 µs, total: 652 µs
Wall time: 665 µs
[13]:
'<user> : boleh tidak tunggu untuk yang <date> <hashtag> sentimen </hashtag> talks ! <allcaps> yay <elongated> </allcaps> ! <repeated> :-d <url>'
[14]:
%%time
' '.join(preprocessing.process(string_4))
CPU times: user 495 µs, sys: 12 µs, total: 507 µs
Wall time: 530 µs
[14]:
'ah <elongated> , malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'
[15]:
%%time
' '.join(preprocessing.process(string_5))
CPU times: user 327 µs, sys: 6 µs, total: 333 µs
Wall time: 346 µs
[15]:
'<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathirnajib </hashtag>'

Load default paramaters with segmenter to expand hashtags.

We saw <hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag>, we want to expand to become dr mahathir and najib razak.

[16]:
segmenter = malaya.segmentation.transformer(model = 'small', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:74: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:69: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

[17]:
segmenter = malaya.segmentation.transformer(model = 'small', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
[18]:
%%time
preprocessing = malaya.preprocessing.preprocessing(segmenter = segmenter)
CPU times: user 88.3 ms, sys: 18.9 ms, total: 107 ms
Wall time: 107 ms
[19]:
%%time
' '.join(preprocessing.process(string_1))
CPU times: user 1.61 s, sys: 1.83 s, total: 3.43 s
Wall time: 1.06 s
[19]:
'<allcaps> tak boleh tunggu </allcaps> untuk yang baru musim daripada <hashtag> mahathir mohamad </hashtag> \\(^o^)/ ! <repeated> <hashtag> davidlynch </hashtag> <hashtag> tv series </hashtag> <happy> , <allcaps> taak <elongated> saabaar <elongated> </allcaps> ! <repeated>'
[20]:
%%time
' '.join(preprocessing.process(string_2))
CPU times: user 726 ms, sys: 375 ms, total: 1.1 s
Wall time: 293 ms
[20]:
'kecewanya <hashtag> johndoe </hashtag> filem dan ia suucks <elongated> ! <repeated> <allcaps> dibazirkan </allcaps> <money> . <repeated> <money> <hashtag> bad movies </hashtag> <annoyed>'
[21]:
%%time
' '.join(preprocessing.process(string_3))
CPU times: user 332 ms, sys: 108 ms, total: 440 ms
Wall time: 112 ms
[21]:
'<user> : boleh tidak tunggu untuk yang <date> <hashtag> sentimen </hashtag> talks ! <allcaps> yaay <elongated> </allcaps> ! <repeated> :-d <url>'
[22]:
%%time
' '.join(preprocessing.process(string_4))
CPU times: user 525 ms, sys: 592 ms, total: 1.12 s
Wall time: 237 ms
[22]:
'aahh <elongated> , malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'
[23]:
%%time
' '.join(preprocessing.process(string_5))
CPU times: user 1.5 s, sys: 575 ms, total: 2.07 s
Wall time: 516 ms
[23]:
'<hashtag> dr mahathir </hashtag> <hashtag> najib razak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathir najib </hashtag>'

Load default paramaters with stemming and lemmatization

[44]:
sastrawi = malaya.stem.sastrawi()
[45]:
%%time
preprocessing = malaya.preprocessing.preprocessing(stemmer = sastrawi)
CPU times: user 112 ms, sys: 18.4 ms, total: 130 ms
Wall time: 129 ms
[26]:
%%time
' '.join(preprocessing.process(string_1))
CPU times: user 11.6 ms, sys: 846 µs, total: 12.5 ms
Wall time: 12.2 ms
[26]:
'<allcaps> tak boleh tunggu </allcaps> untuk yang baru musim daripada <hashtag> mahathirmohamad </hashtag> o  <repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy>  <allcaps> taak <elongated> saabaar <elongated> </allcaps>  <repeated>'
[47]:
%%time
' '.join(preprocessing.process(string_2))
CPU times: user 5.61 ms, sys: 503 µs, total: 6.11 ms
Wall time: 5.71 ms
[47]:
'kecewa <hashtag> johndoe </hashtag> filem dan ia suucks <elongated>  <repeated> <allcaps> dibazirkan </allcaps> <money>  <repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'
[28]:
%%time
' '.join(preprocessing.process(string_3))
CPU times: user 2.13 ms, sys: 57 µs, total: 2.19 ms
Wall time: 2.25 ms
[28]:
'<user>  boleh tidak tunggu untuk yang <date> <hashtag> sentimen </hashtag> talks  <allcaps> yaay <elongated> </allcaps>  <repeated> -d <url>'
[29]:
%%time
' '.join(preprocessing.process(string_4))
CPU times: user 1.81 ms, sys: 20 µs, total: 1.83 ms
Wall time: 1.91 ms
[29]:
'aahh <elongated>  malasnye nak pergi kerja hari ini <hashtag> mondayblues </hashtag>'
[30]:
%%time
' '.join(preprocessing.process(string_5))
CPU times: user 1.91 ms, sys: 13 µs, total: 1.92 ms
Wall time: 1.95 ms
[30]:
'<hashtag> drmahathir </hashtag> <hashtag> najibrazak </hashtag> <hashtag> 1 malaysia </hashtag> <hashtag> mahathirnajib </hashtag>'
[46]:
%%time
' '.join(preprocessing.process('saya disini berjalan pergi ke putrajaya, #masjidbesi'))
CPU times: user 3.45 ms, sys: 30 µs, total: 3.48 ms
Wall time: 3.49 ms
[46]:
'saya sini jalan pergi ke putrajaya  <hashtag> masjidbesi </hashtag>'

disable english translation

But there are basic normalizations that cannot override, like, for automatically become untuk. You can check default entire normalizations at from malaya.texts._tatabahasa import rules_normalizer

[31]:
%%time
preprocessing = malaya.preprocessing.preprocessing(translate_english_to_bm = False)
CPU times: user 96 µs, sys: 1 µs, total: 97 µs
Wall time: 101 µs
[32]:
%%time
' '.join(preprocessing.process(string_1))
CPU times: user 867 µs, sys: 7 µs, total: 874 µs
Wall time: 891 µs
[32]:
'<allcaps> tak boleh wait </allcaps> untuk the new season of <hashtag> mahathirmohamad </hashtag> \\(^o^)/ ! <repeated> <hashtag> davidlynch </hashtag> <hashtag> tvseries </hashtag> <happy> , <allcaps> taak <elongated> saabaar <elongated> </allcaps> ! <repeated>'
[33]:
%%time
' '.join(preprocessing.process(string_2))
CPU times: user 509 µs, sys: 9 µs, total: 518 µs
Wall time: 538 µs
[33]:
'kecewanya <hashtag> johndoe </hashtag> movie and it suucks <elongated> ! <repeated> <allcaps> wasted </allcaps> <money> . <repeated> <money> <hashtag> badmovies </hashtag> <annoyed>'
[34]:
%%time
' '.join(preprocessing.process(string_3))
CPU times: user 477 µs, sys: 6 µs, total: 483 µs
Wall time: 519 µs
[34]:
'<user> : can not wait untuk the <date> <hashtag> sentiment </hashtag> talks ! <allcaps> yaay <elongated> </allcaps> ! <repeated> :-d <url>'

Tokenizer

It able to tokenize multiple regex pipelines, you can check the list from malaya.preprocessing.get_normalize()

[35]:
tokenizer = malaya.preprocessing.TOKENIZER().tokenize
[36]:
tokenizer(string_1)
[36]:
['CANT',
 'WAIT',
 'for',
 'the',
 'new',
 'season',
 'of',
 '#mahathirmohamad',
 '\(^o^)/',
 '!',
 '!',
 '!',
 '#davidlynch',
 '#tvseries',
 ':)))',
 ',',
 'TAAAK',
 'SAAABAAR',
 '!',
 '!',
 '!']
[37]:
tokenizer(string_2)
[37]:
['kecewanya',
 '#johndoe',
 'movie',
 'and',
 'it',
 'suuuuucks',
 '!',
 '!',
 '!',
 'WASTED',
 'RM10',
 '.',
 '.',
 '.',
 'rm10',
 '#badmovies',
 ':/']
[38]:
tokenizer(string_3)
[38]:
['@husein',
 ':',
 'can',
 "'",
 't',
 'wait',
 'for',
 'the',
 'Nov 9',
 '#Sentiment',
 'talks',
 '!',
 'YAAAAAAY',
 '!',
 '!',
 '!',
 ':-D',
 'http://sentimentsymposium.com/.']
[39]:
tokenizer('saya nak makan ayam harga rm10k')
[39]:
['saya', 'nak', 'makan', 'ayam', 'harga', 'rm10k']