\n",
"\n",
"This tutorial is available as an IPython notebook at [Malaya/example/spelling-correction-probability](https://github.com/huseinzol05/Malaya/tree/master/example/spelling-correction-probability).\n",
" \n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This spelling correction extends the functionality of the Peter Norvig's spell-corrector in http://norvig.com/spell-correct.html\n",
"\n",
"And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews,\n",
"https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews\n",
"\n",
"Also added custom vowels augmentation."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ['CUDA_VISIBLE_DEVICES'] = ''\n",
"os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"\n",
"logging.basicConfig(level=logging.INFO)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ubuntu/dev/malaya/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361\n",
" self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
"/home/ubuntu/dev/malaya/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879\n",
" self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
]
}
],
"source": [
"import malaya"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# some text examples copied from Twitter\n",
"\n",
"string1 = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'\n",
"string2 = 'Husein ska mkn aym dkat kampng Jawa'\n",
"string3 = 'Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.'\n",
"string4 = 'Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.'\n",
"string5 = 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'\n",
"string6 = 'blh bntg dlm kls nlp sy, nnti intch'\n",
"string7 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load probability model\n",
"\n",
"```python\n",
"def load(\n",
" language_model=None,\n",
" sentence_piece: bool = False,\n",
" stemmer=None,\n",
" **kwargs,\n",
"):\n",
" \"\"\"\n",
" Load a Probability Spell Corrector.\n",
"\n",
" Parameters\n",
" ----------\n",
" language_model: Callable, optional (default=None)\n",
" If not None, must an instance of kenlm.Model.\n",
" sentence_piece: bool, optional (default=False)\n",
" if True, reduce possible augmentation states using sentence piece.\n",
" stemmer: Callable, optional (default=None)\n",
" a Callable object, must have `stem_word` method.\n",
"\n",
" Returns\n",
" -------\n",
" result: model\n",
" List of model classes:\n",
"\n",
" * if passed `language_model` will return `malaya.spelling_correction.probability.ProbabilityLM`.\n",
" * else will return `malaya.spelling_correction.probability.Probability`.\n",
" \"\"\"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json\n"
]
}
],
"source": [
"model = malaya.spelling_correction.probability.load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### List possible generated pool of words\n",
"\n",
"```python\n",
"def edit_candidates(self, word):\n",
" \"\"\"\n",
" Generate candidates given a word.\n",
"\n",
" Parameters\n",
" ----------\n",
" word: str\n",
"\n",
" Returns\n",
" -------\n",
" result: List[str]\n",
" \"\"\"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['mahathir']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.edit_candidates('mhthir')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['sembung',\n",
" 'sumbang',\n",
" 'sambong',\n",
" 'sambung',\n",
" 'sambang',\n",
" 'sembong',\n",
" 'sumbing',\n",
" 'sembing',\n",
" 'simbang',\n",
" 'sembang',\n",
" 'sombong']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.edit_candidates('smbng')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you can see, `edit_candidates` suggested quite a lot candidates and some of candidates not an actual word like `sambang`, to reduce that, we can use [sentencepiece](https://github.com/google/sentencepiece) to check a candidate a legit word for malaysia context or not."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/sp10m.cased.v4.vocab\n",
"INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/sp10m.cased.v4.model\n",
"INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json\n"
]
},
{
"data": {
"text/plain": [
"['sembung',\n",
" 'sumbang',\n",
" 'sambong',\n",
" 'sambung',\n",
" 'sembong',\n",
" 'sumbing',\n",
" 'sembing',\n",
" 'sembang',\n",
" 'sombong']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_sp = malaya.spelling_correction.probability.load(sentence_piece = True)\n",
"model_sp.edit_candidates('smbng')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**So how does the model knows which words need to pick? highest counts from the corpus!**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### To correct a word\n",
"\n",
"```python\n",
"def correct(self, word: str, **kwargs):\n",
" \"\"\"\n",
" Most probable spelling correction for word.\n",
"\n",
" Parameters\n",
" ----------\n",
" word: str\n",
"\n",
" Returns\n",
" -------\n",
" result: str\n",
" \"\"\"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'suka'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.correct('suke')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'kepada'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.correct('kpd')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'kerajaan'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.correct('krajaan')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### To correct a sentence\n",
"\n",
"```python\n",
"def correct_text(self, text: str):\n",
" \"\"\"\n",
" Correct all the words within a text, returning the corrected text.\n",
"\n",
" Parameters\n",
" ----------\n",
" text: str\n",
"\n",
" Returns\n",
" -------\n",
" result: str\n",
" \"\"\"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'kerajaan patut bagi pencen awal sakit kepada warga emas supaya emosi'"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.correct_text(string1)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"tokenizer = malaya.tokenizer.Tokenizer()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Husein ska mkn aym dkat kampng Jawa'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string2"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Hussein suka makan ayam dekat kampung Jawa'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenized = tokenizer.tokenize(string2)\n",
"model.correct_text(' '.join(tokenized))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Melayu malas ini narration dia sama sahaja macam men are trash . True to some , false to some .'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenized = tokenizer.tokenize(string3)\n",
"model.correct_text(' '.join(tokenized))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenized = tokenizer.tokenize(string5)\n",
"model.correct_text(' '.join(tokenized))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'boleh bintang dalam kelas nlp saya , nanti intch'"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenized = tokenizer.tokenize(string6)\n",
"model.correct_text(' '.join(tokenized))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenized = tokenizer.tokenize(string7)\n",
"model.correct_text(' '.join(tokenized))"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['suluhkan', 'silahkan', 'salahakan', 'salahkan']"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.edit_candidates('slhkn')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load stemmer for probability model\n",
"\n",
"By default kata imbuhan captured using naive regex pattern without understand the word structure, and problem with that, there are so many rules need to hardcode, so we can use better stemmer model like `malaya.stem.deep_model(model = 'noisy')`."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:malaya_boilerplate.frozen_graph:running home/ubuntu/.cache/huggingface/hub using device /device:CPU:0\n",
"2022-09-01 21:46:11.257398: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA\n",
"To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
"2022-09-01 21:46:11.261443: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n",
"2022-09-01 21:46:11.261459: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: huseincomel-desktop\n",
"2022-09-01 21:46:11.261462: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: huseincomel-desktop\n",
"2022-09-01 21:46:11.261512: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program\n",
"2022-09-01 21:46:11.261529: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3\n"
]
}
],
"source": [
"stemmer = malaya.stem.deep_model(model = 'noisy')"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:malaya_boilerplate.huggingface:downloading frozen huseinzol05/v27-preprocessing/bm_1grams.json\n"
]
}
],
"source": [
"model_stemmer = malaya.spelling_correction.probability.load(stemmer = stemmer)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['suluhkan', 'silahkan', 'salahakan', 'salahkan']"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_stemmer.edit_candidates('slhkn')"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik'"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenized = tokenizer.tokenize(string7)\n",
"model_stemmer.correct_text(' '.join(tokenized))"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"s = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik , dia slhkn org bole hri2 crta sakau then bila kna bls balik xdpt jwb ,kata mcm biasa slh (parti sampah) 🤣🤣🤣 jgn mulakn dlu slhkn org kalau xboleh trima bila kna bls balik 🤣🤣🤣'"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'mulakan salah orang boleh , bila geng itu kena salahkan juga xboleh terima . . pelik , dia salahkan orang bole hari2 cerita sakau then bila kena balas balik xdpt jawab , kata macam biasa salah ( parti sampah ) 🤣 🤣 🤣 jangan mulakan dahulu salahkan orang kalau xboleh terima bila kena balas balik 🤣 🤣 🤣'"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenized = tokenizer.tokenize(s)\n",
"model_stemmer.correct_text(' '.join(tokenized))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}