NSFW Detection#

This tutorial is available as an IPython notebook at Malaya/example/nsfw.

Pretty simple and straightforward, just to detect whether a text is NSFW or not.

[1]:
%%time
import malaya
CPU times: user 2.8 s, sys: 3.64 s, total: 6.44 s
Wall time: 1.97 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

Get label#

[2]:
malaya.nsfw.label
[2]:
['sex', 'gambling', 'negative']

Load lexicon model#

Pretty naive but really effective, lexicon gathered at Malay-Dataset/corpus/nsfw.

def lexicon(**kwargs):
    """
    Load Lexicon NSFW model.

    Returns
    -------
    result : malaya.text.lexicon.nsfw.Lexicon class
    """
[3]:
lexicon_model = malaya.nsfw.lexicon()
[4]:
string1 = 'xxx sgt panas, best weh'
string2 = 'jmpa dekat kl sentral'
string3 = 'Rolet Dengan Wang Sebenar'

Predict batch of strings#

[5]:
lexicon_model.predict([string1, string2, string3])
[5]:
['sex', 'negative', 'gambling']

Load multinomial model#

All model interface will follow sklearn interface started v3.4,

def multinomial(**kwargs):
    """
    Load multinomial NSFW model.

    Returns
    -------
    result : malaya.model.ml.BAYES class
    """
[6]:
model = malaya.nsfw.multinomial()
/home/husein/.local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator ComplementNB from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
/home/husein/.local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
/home/husein/.local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

Predict batch of strings#

[7]:
model.predict([string1, string2, string3])
/home/husein/dev/malaya/malaya/model/stem.py:28: FutureWarning: Possible nested set at position 3
  or re.findall(_expressions['ic'], word.lower())
[7]:
['sex', 'negative', 'gambling']

Predict batch of strings with probability#

[8]:
model.predict_proba([string1, string2, string3])
[8]:
[{'sex': 0.9357058034930408,
  'gambling': 0.02616353532998711,
  'negative': 0.03813066117697173},
 {'sex': 0.027541900360621846,
  'gambling': 0.03522626245360637,
  'negative': 0.9372318371857732},
 {'sex': 0.01865380888750343,
  'gambling': 0.9765340760395791,
  'negative': 0.004812115072918792}]