{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Grapheme-to-Phoneme DBP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/phoneme](https://github.com/huseinzol05/Malaya/tree/master/example/phoneme).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module only trained on standard language structure, so it is not save to use it for local language structure.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module only trained on 600 samples from https://prpm.dbp.gov.my/ Glosari Dialek.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanation\n", "\n", "Phonemizer is a Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form, for an example, [https://prpm.dbp.gov.my/Cari1?keyword=acaq&d=150348&#LIHATSINI](https://prpm.dbp.gov.my/Cari1?keyword=acaq&d=150348&#LIHATSINI),\n", "\n", "`acaq` -> `[A.tSAâÖ]`" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.15 s, sys: 1.52 s, total: 7.66 s\n", "Wall time: 11.3 s\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use deep learning model\n", "\n", "Load LSTM + Bahdanau Attention phonemizer model.\n", "\n", "If you are using Tensorflow 2, make sure Tensorflow Addons already installed,\n", "\n", "```bash\n", "pip install tensorflow-addons U\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "def deep_model(quantized=False, **kwargs):\n", " \"\"\"\n", " Load LSTM + Bahdanau Attention phonetic model, \n", " originally from https://prpm.dbp.gov.my/ Glosari Dialek.\n", "\n", " Original size 10.4MB, quantized size 2.77MB .\n", "\n", " Parameters\n", " ----------\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result: malaya.model.tf.Seq2SeqLSTM class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "model = malaya.phoneme.deep_model()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "quantized_model = malaya.phoneme.deep_model(quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict\n", "\n", "```python\n", "def predict(self, strings: List[str], beam_search: bool = False):\n", " \"\"\"\n", " Convert to target string.\n", "\n", " Parameters\n", " ----------\n", " strings : List[str]\n", " beam_search : bool, (optional=False)\n", " If True, use beam search decoder, else use greedy decoder.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```\n", "\n", "If want to speed up the inference, set `beam_search = False`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sA.jA su.kA mA.kAn A.jAm', 'A.jAâÖ A.tSAâÖ kot.S\\x8d)Ö']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(['saya suka makan ayam', 'ayaq acaq kotoq'])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sA.jA su.kA mA.kAn A.jAm', 'A.jAâÖ A.tSAâÖ kot.S\\x8d)Ö']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quantized_model.predict(['saya suka makan ayam', 'ayaq acaq kotoq'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Limitation\n", "\n", "Not able to convert numbers to phoneme." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['A']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(['123'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "you have to use normalization like https://malaya.readthedocs.io/en/latest/load-num2word.html" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['s«.ÒAt du.wA pu.luh ti.gA']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict([malaya.num2word.to_cardinal(123)])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }