{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# EN-MS alignment using HuggingFace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/alignment-en-ms-huggingface](https://github.com/huseinzol05/Malaya/tree/master/example/alignment-en-ms-huggingface).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Required Tensorflow >= 2.0 for HuggingFace interface.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.72 s, sys: 1.08 s, total: 6.8 s\n", "Wall time: 8.64 s\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Transformers\n", "\n", "Make sure you already installed transformers,\n", "\n", "```bash\n", "pip3 install transformers\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## List available HuggingFace models" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)
mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms599
bert-base-multilingual-cased714
\n", "
" ], "text/plain": [ " Size (MB)\n", "mesolitica/finetuned-bert-base-multilingual-cas... 599\n", "bert-base-multilingual-cased 714" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.alignment.en_ms.available_huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load HuggingFace model\n", "\n", "```python\n", "def huggingface(model: str = 'mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms', **kwargs):\n", " \"\"\"\n", " Load huggingface BERT model word alignment for EN-MS, Required Tensorflow >= 2.0.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms'`` - finetuned BERT multilanguage on noisy EN-MS.\n", " * ``'bert-base-multilingual-cased'`` - pretrained BERT multilanguage.\n", "\n", " Returns\n", " -------\n", " result: malaya.model.alignment.HuggingFace\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some layers from the model checkpoint at mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms were not used when initializing TFBertModel: ['mlm___cls']\n", "- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n", "Some layers of TFBertModel were not initialized from the model checkpoint at mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms and are newly initialized: ['bert/pooler/dense/kernel:0', 'bert/pooler/dense/bias:0']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] } ], "source": [ "model = malaya.alignment.en_ms.huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Align\n", "\n", "```python\n", "def align(\n", " self,\n", " source: List[str],\n", " target: List[str],\n", " align_layer: int = 8,\n", " threshold: float = 1e-3,\n", "):\n", " \"\"\"\n", " align text using softmax output layers.\n", "\n", " Parameters\n", " ----------\n", " source: List[str]\n", " target: List[str]\n", " align_layer: int, optional (default=3)\n", " transformer layer-k to choose for embedding output.\n", " threshold: float, optional (default=1e-3)\n", " minimum probability to assume as alignment.\n", "\n", " Returns\n", " -------\n", " result: List[List[Tuple]]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "right = ['Terminal 1 KKIA dilengkapi kemudahan 64 kaunter daftar masuk, 12 aero bridge selain mampu menampung 3,200 penumpang dalam satu masa.']\n", "left = ['Terminal 1 KKIA is equipped with 64 check-in counters, 12 aero bridges and can accommodate 3,200 passengers at a time.']" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[[(0, 0),\n", " (1, 1),\n", " (2, 2),\n", " (3, 4),\n", " (4, 5),\n", " (5, 6),\n", " (6, 7),\n", " (6, 8),\n", " (8, 8),\n", " (9, 9),\n", " (10, 10),\n", " (11, 11),\n", " (12, 12),\n", " (13, 13),\n", " (14, 14),\n", " (15, 15),\n", " (16, 16),\n", " (17, 17),\n", " (18, 18),\n", " (19, 19)]]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = model.align(left, right, align_layer = 7)\n", "results" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 Terminal Terminal\n", "0 1 1\n", "0 KKIA KKIA\n", "0 dilengkapi is\n", "0 kemudahan equipped\n", "0 64 with\n", "0 kaunter 64\n", "0 kaunter 64\n", "0 masuk, counters,\n", "0 12 12\n", "0 aero aero\n", "0 bridge bridges\n", "0 selain and\n", "0 mampu can\n", "0 menampung accommodate\n", "0 3,200 3,200\n", "0 penumpang passengers\n", "0 dalam at\n", "0 satu a\n", "0 masa. time.\n" ] } ], "source": [ "for i in range(len(left)):\n", " left_splitted = left[i].split()\n", " right_splitted = right[i].split()\n", " for k in results[i]:\n", " print(i, left_splitted[k[0]], right_splitted[k[0]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }