{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NSFW Detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya/example/nsfw](https://github.com/huseinzol05/Malaya/tree/master/example/nsfw).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pretty simple and straightforward, just to detect whether a text is NSFW or not." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.8 s, sys: 3.64 s, total: 6.44 s\n", "Wall time: 1.97 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "%%time\n", "import malaya" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get label" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sex', 'gambling', 'negative']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya.nsfw.label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load lexicon model\n", "\n", "Pretty naive but really effective, lexicon gathered at [Malay-Dataset/corpus/nsfw](https://github.com/huseinzol05/Malay-Dataset/tree/master/corpus/nsfw).\n", "\n", "```python\n", "def lexicon(**kwargs):\n", " \"\"\"\n", " Load Lexicon NSFW model.\n", "\n", " Returns\n", " -------\n", " result : malaya.text.lexicon.nsfw.Lexicon class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "lexicon_model = malaya.nsfw.lexicon()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "string1 = 'xxx sgt panas, best weh'\n", "string2 = 'jmpa dekat kl sentral'\n", "string3 = 'Rolet Dengan Wang Sebenar'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict batch of strings" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sex', 'negative', 'gambling']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lexicon_model.predict([string1, string2, string3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load multinomial model\n", "\n", "All model interface will follow sklearn interface started v3.4,\n", "\n", "```python\n", "def multinomial(**kwargs):\n", " \"\"\"\n", " Load multinomial NSFW model.\n", "\n", " Returns\n", " -------\n", " result : malaya.model.ml.BAYES class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator ComplementNB from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n", "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n", " warnings.warn(\n", "/home/husein/.local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n", "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n", " warnings.warn(\n", "/home/husein/.local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n", "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n", " warnings.warn(\n" ] } ], "source": [ "model = malaya.nsfw.multinomial()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict batch of strings" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya/malaya/model/stem.py:28: FutureWarning: Possible nested set at position 3\n", " or re.findall(_expressions['ic'], word.lower())\n" ] }, { "data": { "text/plain": [ "['sex', 'negative', 'gambling']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict([string1, string2, string3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict batch of strings with probability" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'sex': 0.9357058034930408,\n", " 'gambling': 0.02616353532998711,\n", " 'negative': 0.03813066117697173},\n", " {'sex': 0.027541900360621846,\n", " 'gambling': 0.03522626245360637,\n", " 'negative': 0.9372318371857732},\n", " {'sex': 0.01865380888750343,\n", " 'gambling': 0.9765340760395791,\n", " 'negative': 0.004812115072918792}]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict_proba([string1, string2, string3])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }