{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# True Case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/true-case](https://github.com/huseinzol05/Malaya/tree/master/example/true-case).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module trained on both standard and local (included social media) language structures, so it is save to use for both.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n", " warn(\"The installed version of bitsandbytes was compiled without GPU support. \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32\n", "CPU times: user 2.73 s, sys: 2.3 s, total: 5.03 s\n", "Wall time: 2.54 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explanation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Common third party NLP services like Google Speech to Text or PDF to Text will returned unsensitive case and no punctuations or mistake punctuations and cases. So True Case can help you.\n", "\n", "1. jom makan di us makanan di sana sedap -> jom makan di US, makanan di sana sedap.\n", "2. kuala lumpur menteri di jabatan perdana menteri datuk seri dr mujahid yusof rawa hari ini mengakhiri lawatan kerja lapan hari ke jordan turki dan bosnia herzegovina lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga tiga negara berkenaan -> KUALA LUMPUR - Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan, Turki dan Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga-tiga negara berkenaan." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "True case only,\n", "\n", "1. Solve mistake / no punctuations.\n", "2. Solve mistake / unsensitive case.\n", "3. Not correcting any grammar." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mesolitica/finetune-true-case-t5-super-tiny-standard-bahasa-cased': {'Size (MB)': 51,\n", " 'WER': 0.105094863,\n", " 'CER': 0.02163576,\n", " 'Suggested length': 256},\n", " 'mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased': {'Size (MB)': 139,\n", " 'WER': 0.0967551738,\n", " 'CER': 0.0201099683,\n", " 'Suggested length': 256},\n", " 'mesolitica/finetune-true-case-t5-small-standard-bahasa-cased': {'Size (MB)': 242,\n", " 'WER': 0.081104625471,\n", " 'CER': 0.016383823,\n", " 'Suggested length': 256}}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.true_case.available_huggingface" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tested on generated dataset at https://f000.backblazeb2.com/file/malay-dataset/true-case/test-set-true-case.json\n" ] } ], "source": [ "print(malaya.true_case.info)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load HuggingFace model\n", "\n", "```python\n", "def huggingface(\n", " model: str = 'mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased',\n", " force_check: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load HuggingFace model to true case.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased')\n", " Check available models at `malaya.true_case.available_huggingface`.\n", " force_check: bool, optional (default=True)\n", " Force check model one of malaya model.\n", " Set to False if you have your own huggingface model.\n", "\n", " Returns\n", " -------\n", " result: malaya.torch_model.huggingface.Generator\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n", "You are using the default legacy behaviour of the . If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n" ] } ], "source": [ "model = malaya.true_case.huggingface()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "string1 = 'jom makan di us makanan di sana sedap'\n", "string2 = 'kuala lumpur menteri di jabatan perdana menteri datuk seri dr mujahid yusof rawa hari ini mengakhiri lawatan kerja lapan hari ke jordan turki dan bosnia herzegovina lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga tiga negara berkenaan'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict\n", "\n", "```python\n", "def generate(self, strings: List[str], **kwargs):\n", " \"\"\"\n", " Generate texts from the input.\n", "\n", " Parameters\n", " ----------\n", " strings : List[str]\n", " **kwargs: vector arguments pass to huggingface `generate` method.\n", " Read more at https://huggingface.co/docs/transformers/main_classes/text_generation\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.\n" ] }, { "data": { "text/plain": [ "['Jom makan di US makanan di sana sedap',\n", " 'KUALA LUMPUR: Menteri di Jabatan Perdana Menteri, Datuk Seri Dr Mujahid Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan Turki dan Bosnia Herzegovina, lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga-tiga negara berkenaan.']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.generate([string1, string2], max_length = 256)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "def random_uppercase(string):\n", " string = [c.upper() if random.randint(0,1) else c for c in string]\n", " return ''.join(string)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'KuAlA lUmPUr MeNtERI Di JabAtan PerdANA menterI DatuK Seri dR mUjaHId yUsOF rAwA HArI Ini MeNgAkHIrI LawaTAN KeRJa lAPAN HARi KE JORDAn TUrki DAn BoSNIA herZEGoVINA LaWatan yANG bErtujUAN meNgEratKAn laGI HuBUnGAN DUA HAlA DENgAN kETiGa tigA NEgAra bERKeNAAn'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = random_uppercase(string2)\n", "r" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Kuala Lumpur Menteri di Jabatan Perdana Menteri Datuk Seri Dr Mujahid Yusof Rawa hari ini mengakhiri lawatan kerja lapan hari ke Jordan Turki dan Bosnia, Herzegovina. Lawatan yang bertujuan mengeratkan lagi hubungan dua hala dengan ketiga tiga negara berkenaan.']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.generate([r], max_length = 256)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### able to infer mixed MS and EN" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'TUN DR MahAtHir MOhAmad and PERIKAtaN NASiOnAl (PN) INfoRMAtion cHIef DaTuk SERi AzmiN ALi MAy haVe difFErENCes, but BoTH mEn are on THe Same paGE ONE thIng – THE beLIeF thaT PaKataN HarAPaN (PH) IS baD nEWs fOr ThE EConOMY.'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string3 = 'i hate chicken but i like fish'\n", "string4 = 'Tun Dr Mahathir Mohamad and Perikatan Nasional (PN) Information chief Datuk Seri Azmin Ali may have differences, but both men are on the same page one thing – the belief that Pakatan Harapan (PH) is bad news for the economy.'\n", "string4 = random_uppercase(string4)\n", "string4" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['I hate chicken but I like fish.',\n", " 'Tun Dr Mahathir Mohamad and Perikatan Nasional (PN) information chief Datuk Seri Azmin Ali may have differences, but both men are on the same page one thing – the belief that Pakatan Harapan (PH) is bad news for the economy.']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.generate([string3, string4], max_length = 256)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }