{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Rules based Normalizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/normalizer](https://github.com/huseinzol05/Malaya/tree/master/example/normalizer).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", " warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmppnxxs_oa\n", "INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmppnxxs_oa/_remote_module_non_scriptable.py\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.81 s, sys: 3.92 s, total: 6.73 s\n", "Wall time: 2.01 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "string1 = 'xjdi ke, y u xsuke makan HUSEIN kt situ tmpt, i hate it. pelikle, pada'\n", "string2 = 'i mmg2 xske mknn HUSEIN kampng tmpat, i love them. pelikle saye'\n", "string3 = 'perdana menteri ke11 sgt suka makn ayam, harganya cuma rm15.50'\n", "string4 = 'pada 10/4, kementerian mengumumkan, 1/100'\n", "string5 = 'Husein Zolkepli dapat tempat ke-12 lumba lari hari ni'\n", "string6 = 'Husein Zolkepli (2011 - 2019) adalah ketua kampng di kedah sekolah King Edward ke-IV'\n", "string7 = '2jam 30 minit aku tunggu kau, 60.1 kg kau ni, suhu harini 31.2c, aku dahaga minum 600ml'\n", "string8 = 'awak sangat hot ye 🔥🔥. 🔥🙂'\n", "string9 = 'hanyalah rm2 ribu'\n", "string10 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik , dia slhkn org bole hri2 crta sakau then bila kna bls balik xdpt jwb ,kata mcm biasa slh (parti sampah) 🤣🤣🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣🤣🤣'\n", "string11 = 'Pemimpin yg hebat, panahan2 fitnah tu akan dituju kepadanya.. harap DS terus bersabar. Jasa baik DS menjadi asbab di sana kelak mahupun rakyat yg terhutang budi juga..'\n", "string12 = 'berehatlh najib.. sudah2 lh tu.. jgn buat rakyat hilang kepercyaan tu pda system kehakiman negara.. klu btl x slh kenapa x dibuktikan semasa sblm rayuan.. sudah lah tu kami dh letih dengan drama korang. ok'\n", "string13 = 'DSNR satu satunya legasi kpd negara penyambung perjuangan bangsa melayu..jatuhnya beliau dek kerana fitnah dan dengkinya manusia..semoga Allah lindungi Najib Bin Razak dunia dan akhirat..Aamiin'\n", "string14 = 'Muhammad Najib sbb malaysiakini dah daftar.... Klu dia fitnah...tertuduh boleh saman.... Klu berita2 yg x daftar...tu yg susah nak saman...sbb x tahu owner'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load normalizer\n", "\n", "1. normalizer can load any spelling correction model, eg, `malaya.spelling_correction.probability.load`, or `malaya.spelling_correction.transformer.load`.\n", "2. normalizer can load any stemmer model, eg, `malaya.stem.deep_model`.\n", "\n", "```python\n", "def load(\n", " speller: Callable = None,\n", " stemmer: Callable = None,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load a Normalizer using any spelling correction model.\n", "\n", " Parameters\n", " ----------\n", " speller: Callable, optional (default=None)\n", " function to correct spelling, must have `correct` or `normalize_elongated` method.\n", " stemmer: Callable, optional (default=None)\n", " function to stem, must have `stem_word` method.\n", " If provide stemmer, will accurately to stem kata imbuhan akhir.\n", "\n", " Returns\n", " -------\n", " result: malaya.normalizer.rules.Normalizer class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [], "source": [ "lm = malaya.language_model.kenlm(model = 'bahasa-wiki-news')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json\n" ] } ], "source": [ "corrector = malaya.spelling_correction.probability.load(language_model = lm)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/stem-lstm-512/model.pt\n", "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" ] } ], "source": [ "stemmer = malaya.stem.huggingface()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/normalizer/rules.py:204: FutureWarning: Possible nested set at position 42\n", " k.lower(): re.compile(_expressions[k]) for k, v in _expressions.items()\n", "/home/husein/dev/malaya/malaya/normalizer/rules.py:204: FutureWarning: Possible nested set at position 3\n", " k.lower(): re.compile(_expressions[k]) for k, v in _expressions.items()\n" ] } ], "source": [ "normalizer = malaya.normalizer.rules.load(corrector, stemmer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### normalize\n", "\n", "```python\n", "def normalize(\n", " self,\n", " string: str,\n", " normalize_text: bool = True,\n", " normalize_url: bool = False,\n", " normalize_email: bool = False,\n", " normalize_year: bool = True,\n", " normalize_telephone: bool = True,\n", " normalize_date: bool = True,\n", " normalize_time: bool = True,\n", " normalize_emoji: bool = True,\n", " normalize_elongated: bool = True,\n", " normalize_hingga: bool = True,\n", " normalize_pada_hari_bulan: bool = True,\n", " normalize_fraction: bool = True,\n", " normalize_money: bool = True,\n", " normalize_units: bool = True,\n", " normalize_percent: bool = True,\n", " normalize_ic: bool = True,\n", " normalize_number: bool = True,\n", " normalize_x_kali: bool = True,\n", " normalize_cardinal: bool = True,\n", " normalize_ordinal: bool = True,\n", " normalize_entity: bool = True,\n", " expand_contractions: bool = True,\n", " check_english_func=is_english,\n", " check_malay_func=is_malay,\n", " translator: Callable = None,\n", " language_detection_word: Callable = None,\n", " acceptable_language_detection: List[str] = ['EN', 'CAPITAL', 'NOT_LANG'],\n", " segmenter: Callable = None,\n", " text_scorer: Callable = None,\n", " text_scorer_window: int = 2,\n", " not_a_word_threshold: float = 1e-4,\n", " dateparser_settings={'TIMEZONE': 'GMT+8'},\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Normalize a string.\n", "\n", " Parameters\n", " ----------\n", " string : str\n", " normalize_text: bool, optional (default=True)\n", " if True, will try to replace shortforms with internal corpus.\n", " normalize_url: bool, optional (default=False)\n", " if True, replace `://` with empty and `.` with `dot`.\n", " `https://huseinhouse.com` -> `https huseinhouse dot com`.\n", " normalize_email: bool, optional (default=False)\n", " if True, replace `@` with `di`, `.` with `dot`.\n", " `husein.zol05@gmail.com` -> `husein dot zol kosong lima di gmail dot com`.\n", " normalize_year: bool, optional (default=True)\n", " if True, `tahun 1987` -> `tahun sembilan belas lapan puluh tujuh`.\n", " if True, `1970-an` -> `sembilan belas tujuh puluh an`.\n", " if False, `tahun 1987` -> `tahun seribu sembilan ratus lapan puluh tujuh`.\n", " normalize_telephone: bool, optional (default=True)\n", " if True, `no 012-1234567` -> `no kosong satu dua, satu dua tiga empat lima enam tujuh`\n", " normalize_date: bool, optional (default=True)\n", " if True, `01/12/2001` -> `satu disember dua ribu satu`.\n", " if True, `Jun 2017` -> `satu Jun dua ribu tujuh belas`.\n", " if True, `2017 Jun` -> `satu Jun dua ribu tujuh belas`.\n", " if False, `2017 Jun` -> `01/06/2017`.\n", " if False, `Jun 2017` -> `01/06/2017`.\n", " normalize_time: bool, optional (default=True)\n", " if True, `pukul 2.30` -> `pukul dua tiga puluh minit`.\n", " if False, `pukul 2.30` -> `'02:00:00'`\n", " normalize_emoji: bool, (default=True)\n", " if True, `🔥` -> `emoji api`\n", " Load from `malaya.preprocessing.demoji`.\n", " normalize_elongated: bool, optional (default=True)\n", " if True, `betuii` -> `betui`.\n", " normalize_hingga: bool, optional (default=True)\n", " if True, `2011 - 2019` -> `dua ribu sebelas hingga dua ribu sembilan belas`\n", " normalize_pada_hari_bulan: bool, optional (default=True)\n", " if True, `pada 10/4` -> `pada sepuluh hari bulan empat`\n", " normalize_fraction: bool, optional (default=True)\n", " if True, `10 /4` -> `sepuluh per empat`\n", " normalize_money: bool, optional (default=True)\n", " if True, `rm10.4m` -> `sepuluh juta empat ratus ribu ringgit`\n", " normalize_units: bool, optional (default=True)\n", " if True, `61.2 kg` -> `enam puluh satu perpuluhan dua kilogram`\n", " normalize_percent: bool, optional (default=True)\n", " if True, `0.8%` -> `kosong perpuluhan lapan peratus`\n", " normalize_ic: bool, optional (default=True)\n", " if True, `911111-01-1111` -> `sembilan satu satu satu satu satu sempang kosong satu sempang satu satu satu satu`\n", " normalize_number: bool, optional (default=True)\n", " if True `0123` -> `kosong satu dua tiga`\n", " normalize_x_kali: bool, optional (default=True)\n", " if True `10x` -> 'sepuluh kali'\n", " normalize_cardinal: bool, optional (default=True)\n", " if True, `123` -> `seratus dua puluh tiga`\n", " normalize_ordinal: bool, optional (default=True)\n", " if True, `ke-123` -> `keseratus dua puluh tiga`\n", " normalize_entity: bool, optional (default=True)\n", " normalize entities, only effect `date`, `datetime`, `time` and `money` patterns string only.\n", " expand_contractions: bool, optional (default=True)\n", " expand english contractions.\n", " check_english_func: Callable, optional (default=malaya.text.function.is_english)\n", " function to check a word in english dictionary, default is malaya.text.function.is_english.\n", " this parameter also will be use for malay text normalization.\n", " check_malay_func: Callable, optional (default=malaya.text.function.is_malay)\n", " function to check a word in malay dictionary, default is malaya.text.function.is_malay.\n", " translator: Callable, optional (default=None)\n", " function to translate EN word to MS word.\n", " language_detection_word: Callable, optional (default=None)\n", " function to detect language for each words to get better translation results.\n", " acceptable_language_detection: List[str], optional (default=['EN', 'CAPITAL', 'NOT_LANG'])\n", " only translate substrings if the results from `language_detection_word` is in `acceptable_language_detection`.\n", " segmenter: Callable, optional (default=None)\n", " function to segmentize word.\n", " If provide, it will expand a word, apaitu -> apa itu\n", " text_scorer: Callable, optional (default=None)\n", " function to validate upper word. \n", " If lower case score is higher or equal than upper case score, will choose lower case.\n", " text_scorer_window: int, optional (default=2)\n", " size of lookback and lookforward to validate upper word.\n", " not_a_word_threshold: float, optional (default=1e-4)\n", " assume a word is not a human word if score lower than `not_a_word_threshold`.\n", " only usable if passed `text_scorer` parameter.\n", " dateparser_settings: Dict, optional (default={'TIMEZONE': 'GMT+8'})\n", " default dateparser setting, check support settings at https://dateparser.readthedocs.io/en/latest/\n", "\n", " Returns\n", " -------\n", " result: {'normalize', 'date', 'money'}\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get better english checker, we prefer to use https://pyenchant.github.io/pyenchant/" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import enchant\n", "d = enchant.Dict('en_US')\n", "\n", "is_english = lambda x: d.check(x)\n", "is_english('lifestyle')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "string = 'boleh dtg 8pagi esok tak atau minggu depan? 2 oktober 2019 2pm, tlong bayar rm 3.2k sekali tau'" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n", "spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.\n" ] }, { "data": { "text/plain": [ "{'normalize': 'boleh dtg pukul lapan esok tak atau minggu depan ? dua Oktober dua ribu sembilan belas pukul empat belas , tolong bayar tiga ribu dua ratus ringgit sekali tau',\n", " 'date': {'minggu depan': datetime.datetime(2023, 10, 20, 14, 3, 50, 902256),\n", " '8AM esok': datetime.datetime(2023, 10, 14, 8, 0),\n", " '2 oktober 2019 2pm': datetime.datetime(2019, 10, 2, 14, 0)},\n", " 'money': {'rm 3.2k': 'RM3200.0'}}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize(string)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'boleh dtg pukul lapan esok tak atau minggu depan ? dua Oktober dua ribu sembilan belas pukul empat belas , tolong bayar tiga ribu dua ratus ringgit sekali tau',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize(string, normalize_entity = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here you can see, Malaya normalizer will normalize `minggu depan` to datetime object, also `3.2k ringgit` to `RM3200`" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'normalize': 'tak jadi ke , kenapa awak tak suka makan HUSEIN kt situ tmpt , saya hate itu . peliklah , pada', 'date': {}, 'money': {}}\n", "{'normalize': 'saya memang-memang tak suka makan HUSEIN kampung tempat , saya love them . peliklah saya', 'date': {}, 'money': {}}\n", "{'normalize': 'perdana menteri kesebelas sgt suka makan ayam , harganya cuma lima belas ringgit lima puluh sen', 'date': {}, 'money': {'rm15.50': 'RM15.50'}}\n", "{'normalize': 'pada sepuluh hari bulan empat , kementerian mengumumkan , satu per seratus', 'date': {}, 'money': {}}\n", "{'normalize': 'Husein Zolkepli dapat tempat kedua belas lumba lari hari ni', 'date': {}, 'money': {}}\n", "{'normalize': 'Husein Zolkepli ( dua ribu sebelas hingga dua ribu sembilan belas ) adalah ketua kampung di kedah sekolah King Edward keempat', 'date': {}, 'money': {}}\n", "{'normalize': 'dua jam tiga puluh minit aku tunggu kau , enam puluh perpuluhan satu kilogram kau ni , suhu harini tiga puluh satu perpuluhan dua celsius , aku dahaga minum enam ratus milliliter', 'date': {'2jam': datetime.datetime(2023, 10, 13, 12, 3, 51, 358111)}, 'money': {}}\n", "{'normalize': 'awak sangat hot ye , emoji api , emoji api . Emoji api , emoji muka tersenyum sedikit', 'date': {}, 'money': {}}\n", "{'normalize': 'hanyalah dua ribu ringgit', 'date': {}, 'money': {'rm2 ribu': 'RM2000.0'}}\n", "{'normalize': 'mulakan slh org boleh , bila geng tuh kena salahkan jgk tak boleh trima . . pelik , dia salahkan org bole hari-hari cerita sakau then bila kena bilas balik tak dapat jwb , kata mcm biasa slh ( parti sampah ) , emoji berguling di lantai ketawa , emoji berguling di lantai ketawa , emoji berguling di lantai ketawa , jgn mulakan dlu salahkan org kalau tak boleh trima bila kena bilas balik , emoji berguling di lantai ketawa , emoji berguling di lantai ketawa , emoji berguling di lantai ketawa', 'date': {}, 'money': {}}\n", "{'normalize': 'Pemimpin yg hebat , panah-panahan fitnah tu akan dituju kepadanya . . harap DS terus bersabar . Jasa baik DS menjadi asbab di sana kelak mahupun rakyat yg terhutang budi juga . .', 'date': {}, 'money': {}}\n", "{'normalize': 'berehatlah najib . . sudah-sudah lh tu . . jgn buat rakyat hilang kepercayaan tu pda system kehakiman negara . . klu betul tak slh kenapa tak dibuktikan semasa sblm rayuan . . sudah lah tu kami dh letih dengan drama korang . ok', 'date': {}, 'money': {}}\n", "{'normalize': 'DATUK SERI NAJIB RAZAK satu satunya legasi kpd negara penyambung perjuangan bangsa melayu . . jatuhnya beliau dek kerana fitnah dan dengkinya manusia . semoga Allah lindungi Najib Bin Razak dunia dan akhirat . . Aamiin', 'date': {}, 'money': {}}\n", "{'normalize': 'Muhammad Najib sbb malaysiakini dah daftar . . . . Kalau dia fitnah . . . tertuduh boleh saman . . . . Kalau berita-berita yg tak daftar . . tu yg susah nak saman . . sbb tak tahu owner', 'date': {}, 'money': {}}\n" ] } ], "source": [ "print(normalizer.normalize(string1))\n", "print(normalizer.normalize(string2))\n", "print(normalizer.normalize(string3))\n", "print(normalizer.normalize(string4))\n", "print(normalizer.normalize(string5))\n", "print(normalizer.normalize(string6))\n", "print(normalizer.normalize(string7))\n", "print(normalizer.normalize(string8))\n", "print(normalizer.normalize(string9))\n", "print(normalizer.normalize(string10))\n", "print(normalizer.normalize(string11))\n", "print(normalizer.normalize(string12))\n", "print(normalizer.normalize(string13))\n", "print(normalizer.normalize(string14))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use translator\n", "\n", "To use translator, pass a callable variable into `translator` parameter,\n", "\n", "```python\n", "print(normalizer.normalize(string1, translator = translator))\n", "```" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/word-en-ms/dictionary.json\n" ] } ], "source": [ "en_ms_vocab = malaya.translation.word(model = 'mesolitica/word-en-ms')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "translator = lambda x: en_ms_vocab.get(x, x)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('sakit', 'aduh')" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "translator('pain'), translator('aduh')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'normalize': 'tak jadi ke , kenapa awak tak suka makan HUSEIN kt situ tmpt , saya benci ia . peliklah , pada', 'date': {}, 'money': {}}\n" ] } ], "source": [ "print(normalizer.normalize(string1, translator = translator))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'normalize': 'saya memang-memang tak suka makan HUSEIN kampung tempat , saya cinta mereka . peliklah saya', 'date': {}, 'money': {}}\n" ] } ], "source": [ "print(normalizer.normalize(string2, translator = translator))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use Neural Translation Machine\n", "\n", "Problem with dictionary based, if the words is not exist, the translation will not work," ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('cinta', 'mereka', 'sakit')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "translator('love'), translator('them'), translator('pain')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n", "You are using the default legacy behaviour of the . If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n", "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" ] } ], "source": [ "nmt = malaya.translation.huggingface()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "nmt_func = lambda x: nmt.generate([x], to_lang = 'ms', max_length = 256)[0]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'normalize': 'tak jadi ke , kenapa awak tak suka makan HUSEIN kt situ tmpt , saya benci ia . peliklah , pada', 'date': {}, 'money': {}}\n" ] } ], "source": [ "print(normalizer.normalize(string1, translator = nmt_func))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'normalize': 'saya memang-memang tak suka makan HUSEIN kampung tempat , saya cinta mereka . peliklah Saya', 'date': {}, 'money': {}}\n" ] } ], "source": [ "print(normalizer.normalize(string2, translator = nmt_func))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use segmenter" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'normalize': 'saya taksuka ayam , tapi saya sukaikan', 'date': {}, 'money': {}}\n" ] } ], "source": [ "print(normalizer.normalize('saya taksuka ayam, tapi saya sukaikan'))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "segmenter = malaya.segmentation.huggingface()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "segmenter_func = lambda x: segmenter.generate([x], max_length = 128)[0]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'normalize': 'saya tidak suka ayam , tapi saya suka ikan', 'date': {}, 'money': {}}\n" ] } ], "source": [ "print(normalizer.normalize('saya taksuka ayam, tapi saya sukaikan', segmenter = segmenter_func))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use stemmer\n", "\n", "By default normalizer will ignore kata imbuhan akhir, so to stem kata imbuhan akhir, provide `stemmer` parameter." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "normalizer_without_stem = malaya.normalize.normalizer(corrector, check_malay_func = None)\n", "normalizer_stem = malaya.normalize.normalizer(corrector, stemmer = stemmer, check_malay_func = None)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'berehatlah najib . . sudah-sudah lh tu . . jgn buat rakyat hilang kepercayaan tu pda system kehakiman negara . . klu betul tak slh kenapa tak dibuktikan semasa sblm rayuan . . sudah lah tu kami dh letih dengan drama korang . ok',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer_without_stem.normalize(string12)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'berehatlah najib . . sudah-sudah lh tu . . jgn buat rakyat hilang kepercayaan tu pda system kehakiman negara . . klu betul tak slh kenapa tak dibuktikan semasa sblm rayuan . . sudah lah tu kami dh letih dengan drama korang . ok',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer_stem.normalize(string12)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'DATUK SERI NAJIB RAZAK satu satunya legasi kpd negara penyambung perjuangan bangsa melayu . . jatuhnya beliau dek kerana fitnah dan dengkinya manusia . semoga Allah lindungi Najib Bin Razak dunia dan akhirat . . Aamiin',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize(string13)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'DATUK SERI NAJIB RAZAK satu satunya legasi kpd negara penyambung perjuangan bangsa melayu . . jatuhnya beliau dek kerana fitnah dan dengkinya manusia . semoga Allah lindungi Najib Bin Razak dunia dan akhirat . . Aamiin',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer_without_stem.normalize(string13)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'DATUK SERI NAJIB RAZAK satu satunya legasi kpd negara penyambung perjuangan bangsa melayu . . jatuhnya beliau dek kerana fitnah dan dengkinya manusia . semoga Allah lindungi Najib Bin Razak dunia dan akhirat . . Aamiin',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer_stem.normalize(string13)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'seadilnya', 'date': {}, 'money': {}}" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer_without_stem.normalize('seadil2nya')" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'seadil-adilnya', 'date': {}, 'money': {}}" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer_stem.normalize('seadil2nya')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Validate uppercase\n", "\n", "Problem with social media text, people sometime do uppercase for kata nama am, so it will skip to do spelling correction. So to fix that, we need to pass `text_scorer` parameter." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.00012796330028274245" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import math\n", "math.exp(lm.score('hi'))" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "text_scorer = lambda x: lm.score(x)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'Konon nak beat the crowd , skali kedai tak bukak haha @ Chef Ammar Xpress Souk Cafe https://t.co/QrcBlq6ftV',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t = 'Konon nak beat the crowd, skali Kedai x bukak ahaha @ Chef Ammar Xpress Souk Cafe https://t.co/QrcBlq6ftV'\n", "normalizer.normalize(t, text_scorer = text_scorer)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'lapan emiten cum dividen Pekan Ini , jangan ketinggalan https://t.co/9BV9OqqJUG',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t = '8 Emiten Cum Dividen Pekan Ini, Jangan Ketinggalan https://t.co/9BV9OqqJUG'\n", "normalizer.normalize(t, text_scorer = text_scorer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Validate non human word\n", "\n", "A non human word like `kasdsahdas` or `kasweadsa`, it can be a laugh pattern or a cursing pattern, so to validate it we can use any text scoring. If the score lesser than the threshold, will skip to do spelling correction." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'bodo la sial hasdsadwq', 'date': {}, 'money': {}}" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('bodo la siallll hasdsadwq', text_scorer = text_scorer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Skip spelling correction\n", "\n", "Simply pass `None` to `speller` to `normalizer = malaya.normalize.normalizer`. By default it is `None`." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer(corrector)\n", "without_corrector_normalizer = malaya.normalize.normalizer(None)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'saya memang-memang tak suka makanan HUSEIN kampung tempat , saya love them . pelikla saya',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize(string2, normalize_elongated = False)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'saya memang-memang tak suka mknn HUSEIN kampng tmpat , saya love them . pelikla saya',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "without_corrector_normalizer.normalize(string2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pass kwargs preprocessing\n", "\n", "Let say you want to skip to normalize date pattern, you can pass kwargs to normalizer, check word tokenizer kwargs at https://malaya.readthedocs.io/en/latest/load-tokenizer-word.html" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 2558\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3088\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "normalizer = malaya.normalize.normalizer(corrector)\n", "skip_date_normalizer = malaya.normalize.normalizer(corrector, date = False)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'tarikh program tersebut empat belas Mei dua ribu dua puluh tiga',\n", " 'date': {'14 mei': datetime.datetime(2023, 5, 14, 0, 0)},\n", " 'money': {}}" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('tarikh program tersebut 14 mei')" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'tarikh program tersebut empat belas mei',\n", " 'date': {'14 mei': datetime.datetime(2023, 5, 14, 0, 0)},\n", " 'money': {}}" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "skip_date_normalizer.normalize('tarikh program tersebut 14 mei')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize text\n", "\n", "If True,\n", "\n", "1. replace `xkisah` -> `tak kisah`.\n", "2. replace `berehatlh` -> `berehatlah`.\n", "3. replace `seadil2nya` -> `seadil-adilnya`.\n", "4. apply spelling correction if passed `speller` parameter.\n", "5. standardize laughing pattern.\n", "6. standardize mengeluh pattern.\n", "7. normalize title,\n", "\n", "```python\n", "{\n", " 'dr': 'Doktor',\n", " 'yb': 'Yang Berhormat',\n", " 'hj': 'Haji',\n", " 'ybm': 'Yang Berhormat Mulia',\n", " 'tyt': 'Tuan Yang Terutama',\n", " 'yab': 'Yang Berhormat',\n", " 'ybm': 'Yang Berhormat Mulia',\n", " 'yabhg': 'Yang Amat Berbahagia',\n", " 'ybhg': 'Yang Berbahagia',\n", " 'miss': 'Cik',\n", "}\n", "```\n", "\n", "Simply `normalizer.normalize(string, normalize_text = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer(corrector, stemmer)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'tak kisah', 'date': {}, 'money': {}}" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('xkisah')" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'berehatlah', 'date': {}, 'money': {}}" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('berehatlh')" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'seadil-adilnya', 'date': {}, 'money': {}}" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('seadil2nya')" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'bukan-bukan', 'date': {}, 'money': {}}" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('bukan2')" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'bukan-bukan haha', 'date': {}, 'money': {}}" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('bukan2 wkwkwkw')" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'bukan-bukan aduh', 'date': {}, 'money': {}}" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('bukan2 haih')" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'dia sakai haha', 'date': {}, 'money': {}}" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('dia sakai hhihihu')" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'aduh maaflah', 'date': {}, 'money': {}}" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('hais sorrylah')" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'Doktor yahaya', 'date': {}, 'money': {}}" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('Dr yahaya')" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'mulakan slh org boleh , bila geng tuh kena salahkan jgk tak boleh trima',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima')" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'betul la , bodo btul', 'date': {}, 'money': {}}" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('aah la, bodo btul')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize url\n", "\n", "Let say you have an `url` word, example, `https://huseinhouse.com`, this parameter going to,\n", "\n", "If True,\n", "\n", "1. replace `://` with empty string.\n", "2. replace `.` with ` dot `.\n", "3. replace digits with string representation.\n", "4. Capitalize `https`, `http`, and `www`.\n", "\n", "Simply `normalizer.normalize(string, normalize_url = True)`, default is `False`." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'web saya ialah https://huseinhouse.com',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('web saya ialah https://huseinhouse.com')" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'web saya ialah HTTPS huseinhouse dot com',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('web saya ialah https://huseinhouse.com', normalize_url = True)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'web saya ialah HTTPS huseinhouse kosong dua sembilan tiga empat dot com',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('web saya ialah https://huseinhouse02934.com', normalize_url = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize email\n", "\n", "Let say you have an `email` word, example, `husein.zol05@gmail.com`, this parameter going to,\n", "\n", "If True,\n", "\n", "1. replace `://` with empty string.\n", "2. replace `.` with ` dot `.\n", "3. replace `@` with ` di `.\n", "4. replace digits with string representation.\n", "\n", "Simply `normalizer.normalize(string, normalize_email = True)`, default is `False`." ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'email saya ialah husein.zol05@gmail.com',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('email saya ialah husein.zol05@gmail.com')" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'email saya ialah husein dot zol kosong lima di gmail dot com',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('email saya ialah husein.zol05@gmail.com', normalize_email = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize year\n", "\n", "1. if True, `tahun 1987` -> `tahun sembilan belas lapan puluh tujuh`.\n", "2. if True, `1970-an` -> `sembilan belas tujuh puluh an`.\n", "3. if False, `tahun 1987` -> `tahun seribu sembilan ratus lapan puluh tujuh`.\n", "\n", "Simply `normalizer.normalize(string, normalize_year = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'empat ratus dollar pada tahun sembilan belas sembilan puluh lapan berbanding lebih seribu dollar',\n", " 'date': {},\n", " 'money': {'$400 ': '$400', '$1000': '$1000'}}" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('$400 pada tahun 1998 berbanding lebih $1000')" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'empat ratus dollar pada sembilan belas tujuh puluhan berbanding lebih seribu dollar',\n", " 'date': {},\n", " 'money': {'$400 ': '$400', '$1000': '$1000'}}" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('$400 pada 1970-an berbanding lebih $1000')" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'empat ratus dollar pada tahun sembilan belas tujuh puluhan berbanding lebih seribu dollar',\n", " 'date': {},\n", " 'money': {'$400 ': '$400', '$1000': '$1000'}}" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('$400 pada tahun 1970-an berbanding lebih $1000')" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'empat ratus dollar pada tahun seribu sembilan ratus sembilan puluh lapan berbanding lebih seribu dollar',\n", " 'date': {},\n", " 'money': {'$400 ': '$400', '$1000': '$1000'}}" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('$400 pada tahun 1998 berbanding lebih $1000', normalize_year = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize telephone\n", "\n", "1. if True, `no 012-1234567` -> `no kosong satu dua, satu dua tiga empat lima enam tujuh`.\n", "\n", "Simply `normalizer.normalize(string, normalize_telephone = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'no saya kosong satu dua, satu dua tiga empat lima enam tujuh',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('no saya 012-1234567')" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'no saya 012-1234567', 'date': {}, 'money': {}}" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('no saya 012-1234567', normalize_telephone = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize date\n", "\n", "1. if True, `01/12/2001` -> `satu disember dua ribu satu`.\n", "2. if False, normalize date string to `%d/%m/%y`.\n", "\n", "Simply `normalizer.normalize(string, normalize_date = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'saya akan gerak pada sebelas Januari dua ribu dua puluh satu',\n", " 'date': {'1/11/2021': datetime.datetime(2021, 1, 11, 0, 0)},\n", " 'money': {}}" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('saya akan gerak pada 1/11/2021')" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'saya akan gerak pada 11/01/2021',\n", " 'date': {'1/11/2021': datetime.datetime(2021, 1, 11, 0, 0)},\n", " 'money': {}}" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('saya akan gerak pada 1/11/2021', normalize_date = False)" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'satu November dua ribu sembilan belas',\n", " 'date': {'1 nov 2019': datetime.datetime(2019, 11, 1, 0, 0)},\n", " 'money': {}}" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('1 nov 2019')" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': '01/11/2019',\n", " 'date': {'1 nov 2019': datetime.datetime(2019, 11, 1, 0, 0)},\n", " 'money': {}}" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('1 nov 2019', normalize_date = False)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'satu Januari seribu sembilan ratus sembilan puluh enam',\n", " 'date': {'januari 1 1996': datetime.datetime(1996, 1, 1, 0, 0)},\n", " 'money': {}}" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('januari 1 1996')" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': '01/01/1996',\n", " 'date': {'januari 1 1996': datetime.datetime(1996, 1, 1, 0, 0)},\n", " 'money': {}}" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('januari 1 1996', normalize_date = False)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'tiga belas Januari dua ribu sembilan belas',\n", " 'date': {'januari 2019': datetime.datetime(2019, 1, 13, 0, 0)},\n", " 'money': {}}" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('januari 2019')" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': '13/01/2019',\n", " 'date': {'januari 2019': datetime.datetime(2019, 1, 13, 0, 0)},\n", " 'money': {}}" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('januari 2019', normalize_date = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize time\n", "\n", "1. if True, `pukul 2.30` -> `pukul dua tiga puluh minit`.\n", "2. if False `2:01pm` -> `pukul 14.01`.\n", "\n", "Simply `normalizer.normalize(string, normalize_time = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'Operasi tamat sepenuhnya pada pukul satu tiga puluh minit tengah hari',\n", " 'date': {'pukul 1:30': datetime.datetime(2023, 10, 13, 1, 30)},\n", " 'money': {}}" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = 'Operasi tamat sepenuhnya pada pukul 1.30 tengah hari'\n", "normalizer.normalize(s, normalize_time = True)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'Operasi tamat sepenuhnya pada pukul 01.30 tengah hari',\n", " 'date': {'pukul 1:30': datetime.datetime(2023, 10, 13, 1, 30)},\n", " 'money': {}}" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = 'Operasi tamat sepenuhnya pada pukul 1.30 tengah hari'\n", "normalizer.normalize(s, normalize_time = False)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'Operasi tamat sepenuhnya pada pukul satu tiga puluh minit lima puluh saat tengah hari',\n", " 'date': {'pukul 1:30:50': datetime.datetime(2023, 10, 13, 1, 30, 50)},\n", " 'money': {}}" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = 'Operasi tamat sepenuhnya pada pukul 1:30:50 tengah hari'\n", "normalizer.normalize(s, normalize_time = True)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'Operasi tamat sepenuhnya pada pukul 01.30:50 tengah hari',\n", " 'date': {'pukul 1:30:50': datetime.datetime(2023, 10, 13, 1, 30, 50)},\n", " 'money': {}}" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = 'Operasi tamat sepenuhnya pada pukul 1:30:50 tengah hari'\n", "normalizer.normalize(s, normalize_time = False)" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'pukul empat belas satu minit',\n", " 'date': {'2:01pm': datetime.datetime(2023, 10, 13, 14, 1)},\n", " 'money': {}}" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('2:01pm')" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'pukul 14.01',\n", " 'date': {'2:01pm': datetime.datetime(2023, 10, 13, 14, 1)},\n", " 'money': {}}" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('2:01pm', normalize_time = False)" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'pukul dua',\n", " 'date': {'2am': datetime.datetime(2023, 10, 13, 2, 0)},\n", " 'money': {}}" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('2AM')" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'pukul 02',\n", " 'date': {'2am': datetime.datetime(2023, 10, 13, 2, 0)},\n", " 'money': {}}" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('2AM', normalize_time = False)" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'pukul empat belas',\n", " 'date': {'2pm': datetime.datetime(2023, 10, 13, 14, 0)},\n", " 'money': {}}" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('2pm')" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'pukul 14',\n", " 'date': {'2pm': datetime.datetime(2023, 10, 13, 14, 0)},\n", " 'money': {}}" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('2pm', normalize_time = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize emoji\n", "\n", "1. if True, `🔥` -> `emoji api`\n", "\n", "Simply `normalizer.normalize(string, normalize_emoji = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'awak adalah betul-betul sial panas , emoji api',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = 'u are really damn hot 🔥'\n", "normalizer.normalize(s, translator = nmt_func)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize elongated\n", "\n", "Any typical elongated word, eg, `pppeeddaaaasss` - > `pedas`, but this elongated normalization required to pass `speller` parameter to perform the best.\n", "\n", "Simply `normalizer.normalize(string, normalize_elongated = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer(corrector, stemmer)" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'saya tak suka makan pedas', 'date': {}, 'money': {}}" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('saayyyyaa ttttaaak ssssukaaa makaan pedas')" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'saayyyyaa ttttaaak ssssukaaa makaan pedas',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('saayyyyaa ttttaaak ssssukaaa makaan pedas', normalize_elongated = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize hingga\n", "\n", "If True,\n", "\n", "1. `2011 - 2019` -> `dua ribu sebelas hingga dua ribu sembilan belas`.\n", "2. `2011.01-2019` - > `dua ribu sebelas perpuluhan kosong satu hingga dua ribu sembilan belas`.\n", "\n", "Simply `normalizer.normalize(string, normalize_hingga = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'dua ribu sebelas hingga dua ribu sembilan belas',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('2011 - 2019', normalize_hingga = True)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'dua ribu sebelas - dua ribu sembilan belas',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('2011 - 2019', normalize_hingga = False)" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'normalize': '2011 - 2019', 'date': {}, 'money': {}}" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('2011 - 2019', normalize_hingga = False, normalize_cardinal = False, normalize_ordinal = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize pada hari bulan\n", "\n", "If True,\n", "\n", "1. `pada 10/4` -> `pada sepuluh hari bulan empat`.\n", "\n", "Simply `normalizer.normalize(string, normalize_pada_hari_bulan = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'pada sepuluh hari bulan empat', 'date': {}, 'money': {}}" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('pada 10/ 4', normalize_pada_hari_bulan = True)" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'pada sepuluh per empat', 'date': {}, 'money': {}}" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('pada 10/4', normalize_pada_hari_bulan = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize fraction\n", "\n", "If True,\n", "\n", "1. `10/4` -> `sepuluh per empat`.\n", "\n", "Simply `normalizer.normalize(string, normalize_fraction = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'sepuluh per empat', 'date': {}, 'money': {}}" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('10/4', normalize_fraction = True)" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'dua ratus satu ribu dua ratus tiga puluh satu perpuluhan satu per empat',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('201231.1 / 4', normalize_fraction = True)" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'dua ratus satu ribu dua ratus tiga puluh satu perpuluhan satu / empat',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('201231.1 / 4', normalize_fraction = False)" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': '201231.1 / 4', 'date': {}, 'money': {}}" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('201231.1 / 4', normalize_fraction = False, normalize_cardinal = False,\n", " normalize_ordinal = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize money\n", "\n", "If True,\n", "\n", "1. `RM10.5` -> `sepuluh ringgit lima puluh sen`.\n", "2. `rm 10.5 sen` -> `sepuluh ringgit lima puluh sen`.\n", "3. `20.2m ringgit` -> `dua puluh juta dua ratus ribu ringgit`.\n", "\n", "And so much more!\n", "\n", "Simply `normalizer.normalize(string, normalize_money = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'sepuluh ringgit lima puluh sen',\n", " 'date': {},\n", " 'money': {'rm10.5': 'RM10.5'}}" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('RM10.5')" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'sepuluh ringgit lima puluh sen',\n", " 'date': {},\n", " 'money': {'rm 10.5': 'RM10.5'}}" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('rm 10.5 sen')" ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'sepuluh ringgit lima belas sen',\n", " 'date': {},\n", " 'money': {'1015 sen': 'RM10.15'}}" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('1015 sen')" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'sepuluh juta empat ratus ribu ringgit',\n", " 'date': {},\n", " 'money': {'rm10.4m': 'RM10400000.0'}}" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('rm10.4m')" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'sepuluh ribu empat ratus dollar',\n", " 'date': {},\n", " 'money': {'$10.4k': '$10400.0'}}" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('$10.4K')" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'dua puluh dua ribu lima ratus dua belas ringgit tiga ribu tiga ratus tiga puluh empat sen',\n", " 'date': {},\n", " 'money': {'22.5123334k ringgit': 'RM22512.3334'}}" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('22.5123334k ringgit')" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'saya ada dua puluh juta dua ratus ribu ringgit',\n", " 'date': {},\n", " 'money': {'20.2m ringgit': 'RM20200000.0'}}" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('saya ada 20.2m ringgit')" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': '22.5123334k ringgit',\n", " 'date': {},\n", " 'money': {'22.5123334k ringgit': 'RM22512.3334'}}" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('22.5123334k ringgit', normalize_money = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize units\n", "\n", "Able to normalize temperature, distance, volume, duration and weight units.\n", "\n", "If True,\n", "\n", "1. `61.2 kg` -> `enam puluh satu perpuluhan dua kilogram`.\n", "2. `61.2km` -> `sepuluh ringgit lima puluh sen`.\n", "\n", "And so much more!\n", "\n", "Simply `normalizer.normalize(string, normalize_units = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'enam puluh satu perpuluhan dua kilogram',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('61.2 KG')" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'enam puluh satu perpuluhan dua kilometer',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('61.2km')" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'enam puluh satu perpuluhan dua celsius',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('61.2c')" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'enam puluh satu perpuluhan dua milliliter',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('61.2 ml')" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'enam puluh satu perpuluhan dua liter', 'date': {}, 'money': {}}" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('61.2 l')" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'enam puluh satu perpuluhan dua jam',\n", " 'date': {'61:2 jam': datetime.datetime(2023, 10, 13, 12, 9, 48, 124543)},\n", " 'money': {}}" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('61.2 jam')" ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'enam puluh satu perpuluhan dua hari', 'date': {}, 'money': {}}" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('61.2 hari')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize percents\n", "\n", "1. If True, `61.2%` -> `enam puluh satu perpuluhan dua peratus`.\n", "\n", "Simply `normalizer.normalize(string, normalize_percent = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'enam puluh satu perpuluhan dua peratus',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('61.2%')" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': '61.2%', 'date': {}, 'money': {}}" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('61.2%', normalize_percent = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize IC\n", "\n", "1. If True, `911111-01-1111` -> `sembilan satu satu satu satu satu sempang kosong satu sempang satu satu satu satu`.\n", "\n", "Simply `normalizer.normalize(string, normalize_ic = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'sembilan satu satu satu satu satu sempang kosong satu sempang satu satu satu satu',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('911111-01-1111')" ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': '911111-01-1111', 'date': {}, 'money': {}}" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('911111-01-1111', normalize_ic = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize Numbers\n", "\n", "If the number starts with `0`, will convert into string representation.\n", "\n", "1. If True, `0123` -> `kosong satu dua tiga`.\n", "\n", "Simply `normalizer.normalize(string, normalize_number = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'kosong satu dua tiga empat', 'date': {}, 'money': {}}" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('01234')" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': '01234', 'date': {}, 'money': {}}" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('01234', normalize_number = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize x kali\n", "\n", "If the word ends with `x` and before that is a digit, will convert into string representation.\n", "\n", "1. If True, `10x` -> `sepuluh kali`.\n", "2. If False, `10x` -> `10 kali`.\n", "\n", "Simply `normalizer.normalize(string, normalize_x_kali = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'saya sokong sepuluh kali', 'date': {}, 'money': {}}" ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('saya sokong 10x')" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'saya sokong 10 kali', 'date': {}, 'money': {}}" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('saya sokong 10x', normalize_x_kali = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize Cardinals\n", "\n", "Any numbers will convert using `malaya.num2word.to_cardinal`.\n", "\n", "1. If True, `123` -> `seratus dua puluh tiga`.\n", "\n", "Simply `normalizer.normalize(string, normalize_cardinal = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer()" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'seratus dua puluh tiga', 'date': {}, 'money': {}}" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('123')" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'seratus dua puluh tiga perpuluhan satu dua tiga empat dua satu dua tiga satu',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('123.123421231')" ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': '123.123421231', 'date': {}, 'money': {}}" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('123.123421231', normalize_cardinal = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize Ordinals\n", "\n", "Any numbers will convert using `malaya.num2word.to_cardinal`.\n", "\n", "1. If True, `123` -> `keseratus dua puluh tiga`.\n", "2. Able to normalize roman numbers, `ke-XXI` -> `kedua puluh satu`.\n", "\n", "Simply `normalizer.normalize(string, normalize_ordinal = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'keseratus dua puluh tiga', 'date': {}, 'money': {}}" ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('123', normalize_cardinal = False)" ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': '123', 'date': {}, 'money': {}}" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('123', normalize_cardinal = False, normalize_ordinal = False)" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'kedua puluh satu', 'date': {}, 'money': {}}" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize('ke-XXI')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalize entity\n", "\n", "normalize entities, only effect `date`, `datetime`, `time` and `money` patterns string only\n", "\n", "Simply `normalizer.normalize(string, normalize_entity = True)`, default is `True`." ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [], "source": [ "string = 'boleh dtg 8pagi esok tak atau minggu depan? 2 oktober 2019 2pm, tlong bayar rm 3.2k sekali tau'" ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [], "source": [ "normalizer = malaya.normalize.normalizer(corrector, stemmer)" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya.normalizer.rules:caching malaya.preprocessing.demoji inside normalizer\n" ] }, { "data": { "text/plain": [ "{'normalize': 'boleh dtg pukul lapan esok tak atau minggu depan ? dua Oktober dua ribu sembilan belas pukul empat belas , tolong bayar tiga ribu dua ratus ringgit sekali tau',\n", " 'date': {'minggu depan': datetime.datetime(2023, 10, 20, 14, 10, 18, 111175),\n", " '8AM esok': datetime.datetime(2023, 10, 14, 8, 0),\n", " '2 oktober 2019 2pm': datetime.datetime(2019, 10, 2, 14, 0)},\n", " 'money': {'rm 3.2k': 'RM3200.0'}}" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize(string)" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'boleh dtg pukul lapan esok tak atau minggu depan ? dua Oktober dua ribu sembilan belas pukul empat belas , tolong bayar tiga ribu dua ratus ringgit sekali tau',\n", " 'date': {},\n", " 'money': {}}" ] }, "execution_count": 164, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize(string, normalize_entity = False)" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'normalize': 'boleh dtg pukul 08 esok tak atau minggu depan ? 02/10/2019 pukul 14 , tolong bayar rm 3.2k sekali tau',\n", " 'date': {'minggu depan': datetime.datetime(2023, 10, 20, 14, 10, 18, 796023),\n", " '8AM esok': datetime.datetime(2023, 10, 14, 8, 0),\n", " '2 oktober 2019 2pm': datetime.datetime(2019, 10, 2, 14, 0)},\n", " 'money': {'rm 3.2k': 'RM3200.0'}}" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalizer.normalize(string, normalize_date = False, normalize_time = False, normalize_money = False,\n", " normalize_cardinal = False, normalize_ordinal = False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }