Entities Recognition

Note

This tutorial is available as an IPython notebook here.

%%time
import malaya
CPU times: user 6.53 s, sys: 1.66 s, total: 8.19 s
Wall time: 13.1 s

BERT model

BERT is the best NER model in term of accuracy, you can check NER accuracy here, https://malaya.readthedocs.io/en/latest/Accuracy.html#entities-recognition. Question is, why BERT?

  1. Transformer model learn the context of a word based on all of its surroundings (live string), bidirectionally. So it much better understand left and right hand side relationships.
  2. Because of transformer able to leverage to context during live string, we dont need to capture available words in this world, instead capture substrings and build the attention after that. BERT will never have Out-Of-Vocab problem.

List available BERT NER models

malaya.entity.available_bert_model()
['multilanguage', 'base', 'small']

Describe supported entities

malaya.describe_entities()
OTHER - Other
law - law, regulation, related law documents, documents, etc
location - location, place
organization - organization, company, government, facilities, etc
person - person, group of people, believes, etc
quantity - numbers, quantity
time - date, day, time, etc
event - unique event happened, etc
string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar  sekiranya mengantuk ketika memandu.'

Load BERT models

model = malaya.entity.bert(model = 'base')
WARNING: Logging before flag parsing goes to stderr.
W0807 17:19:59.994667 4422120896 deprecation_wrapper.py:119] From /Users/huseinzol/Documents/Malaya/malaya/_utils/_utils.py:45: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

W0807 17:19:59.995772 4422120896 deprecation_wrapper.py:119] From /Users/huseinzol/Documents/Malaya/malaya/_utils/_utils.py:46: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

W0807 17:20:09.183666 4422120896 deprecation_wrapper.py:119] From /Users/huseinzol/Documents/Malaya/malaya/_utils/_utils.py:41: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
model.predict(string)
[('Kuala', 'location'),
 ('Lumpur', 'location'),
 ('Sempena', 'OTHER'),
 ('sambutan', 'OTHER'),
 ('Aidilfitri', 'OTHER'),
 ('minggu', 'OTHER'),
 ('depan', 'OTHER'),
 ('Perdana', 'person'),
 ('Menteri', 'person'),
 ('Tun', 'person'),
 ('Dr', 'person'),
 ('Mahathir', 'person'),
 ('Mohamad', 'person'),
 ('dan', 'OTHER'),
 ('Menteri', 'person'),
 ('Pengangkutan', 'person'),
 ('Anthony', 'person'),
 ('Loke', 'person'),
 ('Siew', 'person'),
 ('Fook', 'person'),
 ('menitipkan', 'OTHER'),
 ('pesanan', 'OTHER'),
 ('khas', 'OTHER'),
 ('kepada', 'OTHER'),
 ('orang', 'OTHER'),
 ('ramai', 'OTHER'),
 ('yang', 'OTHER'),
 ('mahu', 'OTHER'),
 ('pulang', 'OTHER'),
 ('ke', 'OTHER'),
 ('kampung', 'OTHER'),
 ('halaman', 'location'),
 ('masing-masing', 'OTHER'),
 ('Dalam', 'OTHER'),
 ('video', 'OTHER'),
 ('pendek', 'OTHER'),
 ('terbitan', 'OTHER'),
 ('Jabatan', 'organization'),
 ('Keselamatan', 'organization'),
 ('Jalan', 'organization'),
 ('Raya', 'organization'),
 ('(Jkjr)', 'organization'),
 ('itu', 'OTHER'),
 ('Dr', 'person'),
 ('Mahathir', 'person'),
 ('menasihati', 'OTHER'),
 ('mereka', 'OTHER'),
 ('supaya', 'OTHER'),
 ('berhenti', 'OTHER'),
 ('berehat', 'OTHER'),
 ('dan', 'OTHER'),
 ('tidur', 'OTHER'),
 ('sebentar', 'OTHER'),
 ('sekiranya', 'OTHER'),
 ('mengantuk', 'OTHER'),
 ('ketika', 'OTHER'),
 ('memandu', 'OTHER')]
model.analyze(string)
{'words': ['Kuala',
  'Lumpur',
  'Sempena',
  'sambutan',
  'Aidilfitri',
  'minggu',
  'depan',
  'Perdana',
  'Menteri',
  'Tun',
  'Dr',
  'Mahathir',
  'Mohamad',
  'dan',
  'Menteri',
  'Pengangkutan',
  'Anthony',
  'Loke',
  'Siew',
  'Fook',
  'menitipkan',
  'pesanan',
  'khas',
  'kepada',
  'orang',
  'ramai',
  'yang',
  'mahu',
  'pulang',
  'ke',
  'kampung',
  'halaman',
  'masing-masing',
  'Dalam',
  'video',
  'pendek',
  'terbitan',
  'Jabatan',
  'Keselamatan',
  'Jalan',
  'Raya',
  '(Jkjr)',
  'itu',
  'Dr',
  'Mahathir',
  'menasihati',
  'mereka',
  'supaya',
  'berhenti',
  'berehat',
  'dan',
  'tidur',
  'sebentar',
  'sekiranya',
  'mengantuk',
  'ketika',
  'memandu'],
 'tags': [{'text': 'Kuala Lumpur',
   'type': 'location',
   'score': 1.0,
   'beginOffset': 0,
   'endOffset': 1},
  {'text': 'Sempena sambutan Aidilfitri minggu depan',
   'type': 'OTHER',
   'score': 1.0,
   'beginOffset': 2,
   'endOffset': 6},
  {'text': 'Perdana Menteri Tun Dr Mahathir Mohamad',
   'type': 'person',
   'score': 1.0,
   'beginOffset': 7,
   'endOffset': 12},
  {'text': 'dan',
   'type': 'OTHER',
   'score': 1.0,
   'beginOffset': 13,
   'endOffset': 13},
  {'text': 'Menteri Pengangkutan Anthony Loke Siew Fook',
   'type': 'person',
   'score': 1.0,
   'beginOffset': 14,
   'endOffset': 19},
  {'text': 'menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing Dalam video pendek terbitan',
   'type': 'OTHER',
   'score': 1.0,
   'beginOffset': 20,
   'endOffset': 36},
  {'text': 'Jabatan Keselamatan Jalan Raya (Jkjr)',
   'type': 'organization',
   'score': 1.0,
   'beginOffset': 37,
   'endOffset': 41},
  {'text': 'itu',
   'type': 'OTHER',
   'score': 1.0,
   'beginOffset': 42,
   'endOffset': 42},
  {'text': 'Dr Mahathir',
   'type': 'person',
   'score': 1.0,
   'beginOffset': 43,
   'endOffset': 44}]}

Load general Malaya entity model

This model able to classify,

  1. date
  2. money
  3. temperature
  4. distance
  5. volume
  6. duration
  7. phone
  8. email
  9. url
  10. time
  11. datetime
  12. local and generic foods, can check available rules in malaya.texts._food
  13. local and generic drinks, can check available rules in malaya.texts._food

We can insert BERT or any deep learning model by passing malaya.entity.general_entity(model = model), as long the model has predict method and return [(string, label), (string, label)]. This is an optional.

model = malaya.entity.bert(model = 'small')
entity = malaya.entity.general_entity(model = model)
WARNING: Logging before flag parsing goes to stderr.
W0918 23:50:03.469161 4506924480 deprecation_wrapper.py:119] From /Users/huseinzol/Documents/Malaya/malaya/_utils/_utils.py:45: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

W0918 23:50:03.470440 4506924480 deprecation_wrapper.py:119] From /Users/huseinzol/Documents/Malaya/malaya/_utils/_utils.py:46: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

W0918 23:50:06.818366 4506924480 deprecation_wrapper.py:119] From /Users/huseinzol/Documents/Malaya/malaya/_utils/_utils.py:41: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
entity.predict('Husein baca buku Perlembagaan yang berharga 3k ringgit dekat kfc sungai petani minggu lepas, 2 ptg 2 oktober 2019 , suhu 32 celcius, sambil makan ayam goreng dan milo o ais')
{'person': ['Husein', 'Perlembagaan'],
 'OTHER': ['baca buku',
  'yang berharga 3k ringgit dekat',
  'minggu lepas',
  'suhu 32 celcius sambil makan ayam goreng dan milo o ais'],
 'organization': ['kfc'],
 'location': ['sungai petani'],
 'time': {'2 PM 2 oktober 2019': datetime.datetime(2019, 10, 2, 14, 0),
  '2 PM': datetime.datetime(2019, 9, 18, 14, 0)},
 'date': {'2 oktober 2019': datetime.datetime(2019, 10, 2, 0, 0),
  'minggu lalu': datetime.datetime(2019, 9, 11, 23, 50, 9, 945045)},
 'money': {'3k ringgit': 'RM3000.0'},
 'temperature': ['32 celcius'],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'datetime': {'2 ptg 2 oktober 2019': datetime.datetime(2019, 10, 2, 14, 0)},
 'food': ['ayam goreng'],
 'drink': ['milo o ais']}
entity.predict('contact Husein at husein.zol05@gmail.com')
{'OTHER': ['contact', 'at', 'zol05 gmail com'],
 'person': ['Husein', 'husein'],
 'date': {},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': ['husein.zol05@gmail.com'],
 'url': [],
 'time': {},
 'datetime': {},
 'food': [],
 'drink': []}
entity.predict('tolong tempahkan meja makan makan nasi dagang dan jus apple, milo tarik esok dekat Restoran Sebulek')
{'OTHER': ['tolong tempahkan meja makan makan nasi dagang dan jus',
  'milo tarik esok dekat'],
 'organization': ['apple'],
 'location': ['Restoran Sebulek'],
 'date': {'esok': datetime.datetime(2019, 9, 19, 23, 51, 38, 859305)},
 'money': {},
 'temperature': [],
 'distance': [],
 'volume': [],
 'duration': [],
 'phone': [],
 'email': [],
 'url': [],
 'time': {},
 'datetime': {},
 'food': ['nasi dagang'],
 'drink': ['milo tarik', 'jus apple']}

List available deep learning models

malaya.entity.available_deep_model()
['concat', 'bahdanau', 'luong']

Load deep learning models

for i in malaya.entity.available_deep_model():
    print('Testing %s model'%(i))
    model = malaya.entity.deep_model(i)
    print(model.predict(string))
    print()
Testing concat model
downloading frozen /Users/huseinzol/Malaya/entity/concat model
19.0MB [00:03, 5.98MB/s]
[('KUALA', 'location'), ('LUMPUR', 'location'), ('Sempena', 'OTHER'), ('sambutan', 'OTHER'), ('Aidilfitri', 'time'), ('minggu', 'time'), ('depan', 'time'), ('Perdana', 'person'), ('Menteri', 'person'), ('Tun', 'person'), ('Dr', 'person'), ('Mahathir', 'person'), ('Mohamad', 'person'), ('dan', 'OTHER'), ('Menteri', 'person'), ('Pengangkutan', 'person'), ('Anthony', 'person'), ('Loke', 'person'), ('Siew', 'person'), ('Fook', 'person'), ('menitipkan', 'person'), ('pesanan', 'OTHER'), ('khas', 'OTHER'), ('kepada', 'OTHER'), ('orang', 'OTHER'), ('ramai', 'OTHER'), ('yang', 'OTHER'), ('mahu', 'OTHER'), ('pulang', 'OTHER'), ('ke', 'OTHER'), ('kampung', 'location'), ('halaman', 'location'), ('masing-masing', 'OTHER'), ('Dalam', 'OTHER'), ('video', 'OTHER'), ('pendek', 'OTHER'), ('terbitan', 'OTHER'), ('Jabatan', 'organization'), ('Keselamatan', 'organization'), ('Jalan', 'organization'), ('Raya', 'organization'), ('(JKJR)', 'location'), ('itu', 'OTHER'), ('Dr', 'person'), ('Mahathir', 'person'), ('menasihati', 'OTHER'), ('mereka', 'OTHER'), ('supaya', 'OTHER'), ('berhenti', 'OTHER'), ('berehat', 'person'), ('dan', 'OTHER'), ('tidur', 'OTHER'), ('sebentar', 'OTHER'), ('sekiranya', 'OTHER'), ('mengantuk', 'OTHER'), ('ketika', 'OTHER'), ('memandu', 'OTHER')]

Testing bahdanau model
[('KUALA', 'location'), ('LUMPUR', 'location'), ('Sempena', 'OTHER'), ('sambutan', 'OTHER'), ('Aidilfitri', 'location'), ('minggu', 'time'), ('depan', 'time'), ('Perdana', 'location'), ('Menteri', 'person'), ('Tun', 'person'), ('Dr', 'person'), ('Mahathir', 'person'), ('Mohamad', 'person'), ('dan', 'OTHER'), ('Menteri', 'person'), ('Pengangkutan', 'person'), ('Anthony', 'person'), ('Loke', 'person'), ('Siew', 'person'), ('Fook', 'person'), ('menitipkan', 'person'), ('pesanan', 'OTHER'), ('khas', 'OTHER'), ('kepada', 'OTHER'), ('orang', 'OTHER'), ('ramai', 'OTHER'), ('yang', 'OTHER'), ('mahu', 'OTHER'), ('pulang', 'OTHER'), ('ke', 'OTHER'), ('kampung', 'location'), ('halaman', 'OTHER'), ('masing-masing', 'OTHER'), ('Dalam', 'OTHER'), ('video', 'OTHER'), ('pendek', 'OTHER'), ('terbitan', 'OTHER'), ('Jabatan', 'organization'), ('Keselamatan', 'organization'), ('Jalan', 'organization'), ('Raya', 'organization'), ('(JKJR)', 'OTHER'), ('itu', 'OTHER'), ('Dr', 'person'), ('Mahathir', 'person'), ('menasihati', 'OTHER'), ('mereka', 'OTHER'), ('supaya', 'OTHER'), ('berhenti', 'OTHER'), ('berehat', 'OTHER'), ('dan', 'OTHER'), ('tidur', 'OTHER'), ('sebentar', 'OTHER'), ('sekiranya', 'OTHER'), ('mengantuk', 'location'), ('ketika', 'OTHER'), ('memandu', 'OTHER')]

Testing luong model
[('KUALA', 'location'), ('LUMPUR', 'location'), ('Sempena', 'OTHER'), ('sambutan', 'OTHER'), ('Aidilfitri', 'organization'), ('minggu', 'time'), ('depan', 'time'), ('Perdana', 'person'), ('Menteri', 'person'), ('Tun', 'person'), ('Dr', 'person'), ('Mahathir', 'person'), ('Mohamad', 'person'), ('dan', 'OTHER'), ('Menteri', 'person'), ('Pengangkutan', 'person'), ('Anthony', 'person'), ('Loke', 'person'), ('Siew', 'person'), ('Fook', 'person'), ('menitipkan', 'OTHER'), ('pesanan', 'OTHER'), ('khas', 'OTHER'), ('kepada', 'OTHER'), ('orang', 'OTHER'), ('ramai', 'OTHER'), ('yang', 'OTHER'), ('mahu', 'OTHER'), ('pulang', 'OTHER'), ('ke', 'OTHER'), ('kampung', 'location'), ('halaman', 'location'), ('masing-masing', 'OTHER'), ('Dalam', 'OTHER'), ('video', 'OTHER'), ('pendek', 'OTHER'), ('terbitan', 'OTHER'), ('Jabatan', 'organization'), ('Keselamatan', 'organization'), ('Jalan', 'organization'), ('Raya', 'organization'), ('(JKJR)', 'location'), ('itu', 'OTHER'), ('Dr', 'person'), ('Mahathir', 'person'), ('menasihati', 'OTHER'), ('mereka', 'OTHER'), ('supaya', 'OTHER'), ('berhenti', 'OTHER'), ('berehat', 'OTHER'), ('dan', 'OTHER'), ('tidur', 'OTHER'), ('sebentar', 'OTHER'), ('sekiranya', 'OTHER'), ('mengantuk', 'organization'), ('ketika', 'OTHER'), ('memandu', 'OTHER')]
bahdanau = malaya.entity.deep_model('bahdanau')
bahdanau.analyze(string)
{'words': ['KUALA',
  'LUMPUR',
  'Sempena',
  'sambutan',
  'Aidilfitri',
  'minggu',
  'depan',
  'Perdana',
  'Menteri',
  'Tun',
  'Dr',
  'Mahathir',
  'Mohamad',
  'dan',
  'Menteri',
  'Pengangkutan',
  'Anthony',
  'Loke',
  'Siew',
  'Fook',
  'menitipkan',
  'pesanan',
  'khas',
  'kepada',
  'orang',
  'ramai',
  'yang',
  'mahu',
  'pulang',
  'ke',
  'kampung',
  'halaman',
  'masing-masing',
  'Dalam',
  'video',
  'pendek',
  'terbitan',
  'Jabatan',
  'Keselamatan',
  'Jalan',
  'Raya',
  '(JKJR)',
  'itu',
  'Dr',
  'Mahathir',
  'menasihati',
  'mereka',
  'supaya',
  'berhenti',
  'berehat',
  'dan',
  'tidur',
  'sebentar',
  'sekiranya',
  'mengantuk',
  'ketika',
  'memandu'],
 'tags': [{'text': 'KUALA LUMPUR',
   'type': 'location',
   'score': 1.0,
   'beginOffset': 0,
   'endOffset': 1},
  {'text': 'Sempena sambutan',
   'type': 'OTHER',
   'score': 1.0,
   'beginOffset': 2,
   'endOffset': 3},
  {'text': 'Aidilfitri',
   'type': 'event',
   'score': 1.0,
   'beginOffset': 4,
   'endOffset': 4},
  {'text': 'minggu depan',
   'type': 'time',
   'score': 1.0,
   'beginOffset': 5,
   'endOffset': 6},
  {'text': 'Perdana Menteri Tun Dr Mahathir Mohamad',
   'type': 'person',
   'score': 1.0,
   'beginOffset': 7,
   'endOffset': 12},
  {'text': 'dan',
   'type': 'OTHER',
   'score': 1.0,
   'beginOffset': 13,
   'endOffset': 13},
  {'text': 'Menteri Pengangkutan Anthony Loke Siew Fook',
   'type': 'person',
   'score': 1.0,
   'beginOffset': 14,
   'endOffset': 19},
  {'text': 'menitipkan pesanan khas kepada orang ramai yang mahu pulang ke',
   'type': 'OTHER',
   'score': 1.0,
   'beginOffset': 20,
   'endOffset': 29},
  {'text': 'kampung',
   'type': 'location',
   'score': 1.0,
   'beginOffset': 30,
   'endOffset': 30},
  {'text': 'halaman masing-masing Dalam video pendek terbitan',
   'type': 'OTHER',
   'score': 1.0,
   'beginOffset': 31,
   'endOffset': 36},
  {'text': 'Jabatan Keselamatan Jalan Raya',
   'type': 'organization',
   'score': 1.0,
   'beginOffset': 37,
   'endOffset': 40},
  {'text': '(JKJR)',
   'type': 'person',
   'score': 1.0,
   'beginOffset': 41,
   'endOffset': 41},
  {'text': 'itu',
   'type': 'OTHER',
   'score': 1.0,
   'beginOffset': 42,
   'endOffset': 42},
  {'text': 'Dr Mahathir',
   'type': 'person',
   'score': 1.0,
   'beginOffset': 43,
   'endOffset': 44},
  {'text': 'menasihati mereka supaya',
   'type': 'OTHER',
   'score': 1.0,
   'beginOffset': 45,
   'endOffset': 47},
  {'text': 'berhenti berehat',
   'type': 'person',
   'score': 1.0,
   'beginOffset': 48,
   'endOffset': 49}]}

Voting stack model

bahdanau = malaya.entity.deep_model('bahdanau')
luong = malaya.entity.deep_model('luong')
bert = malaya.entity.bert('base')
malaya.stack.voting_stack([bert, bahdanau, luong], string)
[('KUALA', 'location'),
 ('LUMPUR', 'location'),
 ('Sempena', 'OTHER'),
 ('sambutan', 'OTHER'),
 ('Aidilfitri', 'organization'),
 ('minggu', 'time'),
 ('depan', 'time'),
 ('Perdana', 'person'),
 ('Menteri', 'person'),
 ('Tun', 'person'),
 ('Dr', 'person'),
 ('Mahathir', 'person'),
 ('Mohamad', 'person'),
 ('dan', 'OTHER'),
 ('Menteri', 'person'),
 ('Pengangkutan', 'person'),
 ('Anthony', 'person'),
 ('Loke', 'person'),
 ('Siew', 'person'),
 ('Fook', 'person'),
 ('menitipkan', 'OTHER'),
 ('pesanan', 'OTHER'),
 ('khas', 'OTHER'),
 ('kepada', 'OTHER'),
 ('orang', 'OTHER'),
 ('ramai', 'OTHER'),
 ('yang', 'OTHER'),
 ('mahu', 'OTHER'),
 ('pulang', 'OTHER'),
 ('ke', 'OTHER'),
 ('kampung', 'location'),
 ('halaman', 'location'),
 ('masing-masing', 'OTHER'),
 ('Dalam', 'OTHER'),
 ('video', 'OTHER'),
 ('pendek', 'OTHER'),
 ('terbitan', 'OTHER'),
 ('Jabatan', 'organization'),
 ('Keselamatan', 'organization'),
 ('Jalan', 'organization'),
 ('Raya', 'organization'),
 ('(JKJR)', 'person'),
 ('itu', 'OTHER'),
 ('Dr', 'person'),
 ('Mahathir', 'person'),
 ('menasihati', 'OTHER'),
 ('mereka', 'OTHER'),
 ('supaya', 'OTHER'),
 ('berhenti', 'OTHER'),
 ('berehat', 'OTHER'),
 ('dan', 'OTHER'),
 ('tidur', 'OTHER'),
 ('sebentar', 'OTHER'),
 ('sekiranya', 'OTHER'),
 ('mengantuk', 'OTHER'),
 ('ketika', 'OTHER'),
 ('memandu', 'OTHER')]