Extract general Malaya entities#

This tutorial is available as an IPython notebook at Malaya/example/general-malaya-entities.

This module only use Regex to extract entities.

[1]:
import logging

logging.basicConfig(level=logging.INFO)
[2]:
%%time
import malaya
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmppjdv8tfx
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmppjdv8tfx/_remote_module_non_scriptable.py
CPU times: user 2.91 s, sys: 3.7 s, total: 6.61 s
Wall time: 2.13 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

Load general Malaya entity model#

This model able to classify,

  1. date

  2. money

  3. temperature

  4. distance

  5. volume

  6. duration

  7. phone

  8. email

  9. url

  10. time

  11. datetime

  12. local and generic foods, can check available rules in malaya.texts.entity.food

  13. local and generic drinks, can check available rules in malaya.texts.entity.food

We can insert BERT or any deep learning model by passing malaya.entity.general_entity(model = model), as long the model has predict method and return [(string, label), (string, label)]. This is an optional.

[3]:
entity = malaya.entity.general_entity()

Examples#

[4]:
entity.predict('Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais')
[4]:
{'date': {'2 oktober 2019': datetime.datetime(2019, 10, 2, 0, 0),
  'minggu lalu': datetime.datetime(2023, 10, 5, 15, 43, 46, 99837)},
 'money': {'3k ringgit': 'RM3000.0'},
 'temperature': ['32 celcius'],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {'2 PM': datetime.datetime(2023, 10, 12, 14, 0)},
 'datetime': {'2 ptg 2 oktober 2019': datetime.datetime(2019, 10, 2, 14, 0)},
 'food': ['ayam goreng'],
 'drink': ['milo o ais'],
 'weight': []}
[5]:
entity.predict('contact Husein at husein.zol05@gmail.com')
[5]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': ['husein.zol05@gmail.com'],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[6]:
entity.predict('tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek')
[6]:
{'date': {'esok': datetime.datetime(2023, 10, 13, 15, 43, 46, 144429)},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': ['nasi dagang'],
 'drink': ['milo tarik', 'jus apple'],
 'weight': []}

date#

[7]:
entity.predict('husein balik rumah pada 2/12/2022')
[7]:
{'date': {'2/12/2022': datetime.datetime(2022, 2, 12, 0, 0)},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[8]:
entity.predict('husein balik rumah pada 2 jan 2022')
[8]:
{'date': {'2 jan 2022': datetime.datetime(2022, 1, 2, 0, 0)},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[9]:
entity.predict('husein balik rumah pada 2022 mac 2')
[9]:
{'date': {'2022 mac 2': datetime.datetime(2022, 3, 2, 0, 0)},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}

money#

[10]:
entity.predict('harga buku 2 ringgit')
[10]:
{'date': {},
 'money': {'2 ringgit': 'RM2'},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[11]:
entity.predict('harga buku rm2.50 sen')
[11]:
{'date': {},
 'money': {'rm2.50 ': 'RM2.50'},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[12]:
entity.predict('harga buku 5.34k ringgit')
[12]:
{'date': {},
 'money': {'5.34k ringgit': 'RM5340.0'},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[13]:
entity.predict('harga buku 5.34m ringgit')
[13]:
{'date': {},
 'money': {'5.34m ringgit': 'RM5340000.0'},
 'temperature': [],
 'distance': ['5.34m'],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[14]:
entity.predict('harga buku 5.34b ringgit')
[14]:
{'date': {},
 'money': {'5.34b ringgit': 'RM5340000000.0'},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[15]:
entity.predict('harga buku rm 5.2')
[15]:
{'date': {},
 'money': {'rm 5.2': 'RM5.2'},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}

temperature#

[16]:
entity.predict('suhu harini 21.3c')
[16]:
{'date': {},
 'money': {},
 'temperature': ['21.3c'],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[17]:
entity.predict('suhu harini 21.3    c')
[17]:
{'date': {},
 'money': {},
 'temperature': ['21.3 c'],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}

distance#

[18]:
entity.predict('sejauh 10 batu')
[18]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': ['10 batu'],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[19]:
entity.predict('sejauh 10.234    km')
[19]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': ['10.234 km'],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}

volume#

[20]:
entity.predict('volume 21.2ml')
[20]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': ['21.2ml'],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}

duration#

[21]:
entity.predict('duration 2jam')
[21]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': ['2jam'],
 'phone': [],
 'email': [],
 'url': [],
 'time': {'2jam': datetime.datetime(2023, 10, 12, 13, 43, 49, 942979)},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[22]:
entity.predict('duration sejam')
[22]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': ['sejam'],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}

phone#

[23]:
entity.predict('no telepon 013-1111111')
[23]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': ['013-1111111'],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}

email#

[24]:
entity.predict('email at husein@email.com')
[24]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': ['husein@email.com'],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}

URL#

[25]:
entity.predict('website di https://huseinhouse.com')
[25]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': ['https://huseinhouse.com'],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}

Time#

[26]:
entity.predict('pada pkul 2')
[26]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {'pukul 2': datetime.datetime(2023, 10, 2, 0, 0)},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[27]:
entity.predict('pada pkul 2.14')
[27]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {'pukul 2.14': datetime.datetime(2023, 10, 12, 2, 14)},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}
[28]:
entity.predict('pada pkul 2:58:59')
[28]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {'2:58:59': datetime.datetime(2023, 10, 12, 2, 58, 59),
  'pukul 2:58:59': datetime.datetime(2023, 10, 12, 2, 58, 59)},
 'datetime': {},
 'food': [],
 'drink': [],
 'weight': []}

datetime#

[29]:
entity.predict('saya gerak 12/02/2022 14:23:21')
[29]:
{'date': {'12/02/2022': datetime.datetime(2022, 12, 2, 0, 0)},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {'14:23:21': datetime.datetime(2023, 10, 12, 14, 23, 21)},
 'datetime': {'12/02/2022 14:23:21': datetime.datetime(2022, 12, 2, 14, 23, 21)},
 'food': [],
 'drink': [],
 'weight': []}
[30]:
entity.predict('saya gerak 12/02/2022 2pm')
[30]:
{'date': {'12/02/2022': datetime.datetime(2022, 12, 2, 0, 0)},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {'2pm': datetime.datetime(2023, 10, 12, 14, 0)},
 'datetime': {'12/02/2022 2pm': datetime.datetime(2022, 12, 2, 14, 0)},
 'food': [],
 'drink': [],
 'weight': []}

local and generic foods#

[31]:
entity.predict('nasi goreng pattaya 1')
[31]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': ['nasi goreng'],
 'drink': [],
 'weight': []}
[32]:
entity.predict('ayam penyet 1')
[32]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': ['ayam penyet'],
 'drink': [],
 'weight': []}

local and generic drinks#

[33]:
entity.predict('teh o ais 1')
[33]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': ['teh o ais'],
 'weight': []}
[34]:
entity.predict('teh ice 1')
[34]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': ['teh ice'],
 'weight': []}
[35]:
entity.predict('nescafe beng 1')
[35]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': ['nescafe beng'],
 'weight': []}
[36]:
entity.predict('jus rambutan 1')
[36]:
{'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': ['jus rambutan'],
 'weight': []}