{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sentence tokenizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/tokenizer-sentence](https://github.com/huseinzol05/Malaya/tree/master/example/tokenizer-sentence).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.15 s, sys: 1.22 s, total: 7.36 s\n", "Wall time: 8.81 s\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentence tokenizer\n", "\n", "We considered prefixes, suffixes, starters, acronyms, websites, emails, digits, before digits, titles, time and month to split a sentence into multiple sentences.\n", "\n", "```python\n", "class SentenceTokenizer:\n", " def __init__(self):\n", " pass\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "s_tokenizer = malaya.tokenizer.SentenceTokenizer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenize\n", "\n", "```python\n", "def tokenize(self, string, minimum_length=5):\n", " \"\"\"\n", " Tokenize string into multiple strings.\n", "\n", " Parameters\n", " ----------\n", " string : str\n", " minimum_length: int, optional (default=5)\n", " minimum length to assume a string is a string, default 5 characters.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['no.1 polis bertemu dengan suspek di ladang getah.',\n", " 'polis tembak pui pui pui bertubi tubi.']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\n", "no. 1 polis bertemu dengan suspek di ladang getah. polis tembak pui pui pui bertubi tubi\n", "\"\"\"\n", "s_tokenizer.tokenize(s)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['email saya di husein.zol01@gmail.com, nanti jom berkopi.']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\n", "email saya di husein.zol01@gmail.com, nanti jom berkopi\n", "\"\"\"\n", "s_tokenizer.tokenize(s)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ke.2 cerita nya begini.',\n", " 'saya berjalan jalan ditepi muara jumpa anak dara.']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\n", "ke. 2 cerita nya begini. saya berjalan jalan ditepi muara jumpa anak dara.\n", "\"\"\"\n", "s_tokenizer.tokenize(s)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ke.2 cerita nya begini.', 'aku jumpa ybhg. dr. syed tadi, sakai gila.']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"\"\"\n", "ke. 2 cerita nya begini. aku jumpa ybhg. dr. syed tadi, sakai gila\n", "\"\"\"\n", "s_tokenizer.tokenize(s)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }