{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Entities Recognition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/entities](https://github.com/huseinzol05/Malaya/tree/master/example/entities).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module only trained on standard language structure, so it is not save to use it for local language structure.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmp0npqw77q\n", "INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmp0npqw77q/_remote_module_non_scriptable.py\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.89 s, sys: 3.49 s, total: 6.38 s\n", "Wall time: 2.3 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Describe supported entities" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[{'Tag': 'OTHER', 'Description': 'other'},\n", " {'Tag': 'law',\n", " 'Description': 'law, regulation, related law documents, documents, etc'},\n", " {'Tag': 'location', 'Description': 'location, place'},\n", " {'Tag': 'organization',\n", " 'Description': 'organization, company, government, facilities, etc'},\n", " {'Tag': 'person',\n", " 'Description': 'person, group of people, believes, unique arts (eg; food, drink), etc'},\n", " {'Tag': 'quantity', 'Description': 'numbers, quantity'},\n", " {'Tag': 'time', 'Description': 'date, day, time, etc'},\n", " {'Tag': 'event', 'Description': 'unique event happened, etc'}]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.entity.describe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace NER models" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mesolitica/ner-t5-tiny-standard-bahasa-cased': {'Size (MB)': 84.7,\n", " 'law': {'precision': 0.9642625081221572,\n", " 'recall': 0.9598965071151359,\n", " 'f1': 0.9620745542949757,\n", " 'number': 1546},\n", " 'person': {'precision': 0.9673319980661648,\n", " 'recall': 0.971424608128728,\n", " 'f1': 0.9693739834584906,\n", " 'number': 14418},\n", " 'time': {'precision': 0.9796992481203007,\n", " 'recall': 0.983148893360161,\n", " 'f1': 0.9814210394175245,\n", " 'number': 3976},\n", " 'location': {'precision': 0.966455899689208,\n", " 'recall': 0.9753406878650227,\n", " 'f1': 0.970877967379017,\n", " 'number': 9246},\n", " 'organization': {'precision': 0.9308265342319971,\n", " 'recall': 0.9475204622051036,\n", " 'f1': 0.9390993140471219,\n", " 'number': 8308},\n", " 'quantity': {'precision': 0.9824689554419284,\n", " 'recall': 0.9853479853479854,\n", " 'f1': 0.9839063643013899,\n", " 'number': 2730},\n", " 'event': {'precision': 0.8535980148883374,\n", " 'recall': 0.8973913043478261,\n", " 'f1': 0.8749470114455278,\n", " 'number': 1150},\n", " 'overall_precision': 0.9585080133195985,\n", " 'overall_recall': 0.9670566055977183,\n", " 'overall_f1': 0.9627633336140621,\n", " 'overall_accuracy': 0.9951433495221682},\n", " 'mesolitica/ner-t5-small-standard-bahasa-cased': {'Size (MB)': 141,\n", " 'law': {'precision': 0.9320327249842668,\n", " 'recall': 0.9579560155239327,\n", " 'f1': 0.9448165869218501,\n", " 'number': 1546},\n", " 'person': {'precision': 0.9745341614906833,\n", " 'recall': 0.9794007490636704,\n", " 'f1': 0.976961394769614,\n", " 'number': 14418},\n", " 'time': {'precision': 0.9583539910758553,\n", " 'recall': 0.9723340040241448,\n", " 'f1': 0.9652933832709114,\n", " 'number': 3976},\n", " 'location': {'precision': 0.9709677419354839,\n", " 'recall': 0.9766385463984426,\n", " 'f1': 0.9737948883856357,\n", " 'number': 9246},\n", " 'organization': {'precision': 0.9493625210488333,\n", " 'recall': 0.9500481463649495,\n", " 'f1': 0.9497052099627,\n", " 'number': 8308},\n", " 'quantity': {'precision': 0.9823008849557522,\n", " 'recall': 0.9758241758241758,\n", " 'f1': 0.9790518191841234,\n", " 'number': 2730},\n", " 'event': {'precision': 0.8669991687448046,\n", " 'recall': 0.9069565217391304,\n", " 'f1': 0.88652783680408,\n", " 'number': 1150},\n", " 'overall_precision': 0.9629220498535133,\n", " 'overall_recall': 0.9691593754531832,\n", " 'overall_f1': 0.9660306446949986,\n", " 'overall_accuracy': 0.9953954840983863}}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.entity.available_huggingface" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "string = 'KUALA LUMPUR: Sempena sambutan Aidilfitri minggu depan, Perdana Menteri Tun Dr Mahathir Mohamad dan Menteri Pengangkutan Anthony Loke Siew Fook menitipkan pesanan khas kepada orang ramai yang mahu pulang ke kampung halaman masing-masing. Dalam video pendek terbitan Jabatan Keselamatan Jalan Raya (JKJR) itu, Dr Mahathir menasihati mereka supaya berhenti berehat dan tidur sebentar sekiranya mengantuk ketika memandu.'\n", "string1 = 'memperkenalkan Husein, dia sangat comel, berumur 25 tahun, bangsa melayu, agama islam, tinggal di cyberjaya malaysia, bercakap bahasa melayu, semua membaca buku undang-undang kewangan, dengar laju Siti Nurhaliza - Seluruh Cinta sambil makan ayam goreng KFC'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load HuggingFace model\n", "\n", "```python\n", "def huggingface(\n", " model: str = 'mesolitica/ner-t5-small-standard-bahasa-cased',\n", " force_check: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load HuggingFace model to Entity Recognition.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='mesolitica/ner-t5-small-standard-bahasa-cased')\n", " Check available models at `malaya.entity.available_huggingface`.\n", " force_check: bool, optional (default=True)\n", " Force check model one of malaya model.\n", " Set to False if you have your own huggingface model.\n", "\n", " Returns\n", " -------\n", " result: malaya.torch_model.huggingface.Tagging\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d4d4a60292184f869008b4924ed4757c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (…)okenizer_config.json: 0%| | 0.00/21.2k [00:00