Keyword Extraction

This tutorial is available as an IPython notebook at Malaya/example/keyword-extraction.

[1]:
import malaya
/Users/huseinzolkepli/Documents/Malaya/malaya/preprocessing.py:259: FutureWarning: Possible nested set at position 2289
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[2]:
# https://www.bharian.com.my/berita/nasional/2020/06/698386/isu-bersatu-tun-m-6-yang-lain-saman-muhyiddin

string = """
Dalam saman itu, plaintif memohon perisytiharan, antaranya mereka adalah ahli BERSATU yang sah, masih lagi memegang jawatan dalam parti (bagi pemegang jawatan) dan layak untuk bertanding pada pemilihan parti.

Mereka memohon perisytiharan bahawa semua surat pemberhentian yang ditandatangani Muhammad Suhaimi bertarikh 28 Mei lalu dan pengesahan melalui mesyuarat Majlis Pimpinan Tertinggi (MPT) parti bertarikh 4 Jun lalu adalah tidak sah dan terbatal.

Plaintif juga memohon perisytiharan bahawa keahlian Muhyiddin, Hamzah dan Muhammad Suhaimi di dalam BERSATU adalah terlucut, berkuat kuasa pada 28 Februari 2020 dan/atau 29 Februari 2020, menurut Fasal 10.2.3 perlembagaan parti.

Yang turut dipohon, perisytiharan bahawa Seksyen 18C Akta Pertubuhan 1966 adalah tidak terpakai untuk menghalang pelupusan pertikaian berkenaan oleh mahkamah.

Perisytiharan lain ialah Fasal 10.2.6 Perlembagaan BERSATU tidak terpakai di atas hal melucutkan/ memberhentikan keahlian semua plaintif.
"""
[3]:
import re

# minimum cleaning, just simply to remove newlines.
def cleaning(string):
    string = string.replace('\n', ' ')
    string = re.sub('[^A-Za-z\-() ]+', ' ', string).strip()
    string = re.sub(r'[ ]+', ' ', string).strip()
    return string

string = cleaning(string)

Use RAKE algorithm

Original implementation from https://github.com/aneesha/RAKE. Malaya added attention mechanism into RAKE algorithm.

def rake(
    string: str,
    model = None,
    vectorizer = None,
    top_k: int = 5,
    atleast: int = 1,
    stopwords = get_stopwords,
    **kwargs
):
    """
    Extract keywords using Rake algorithm.

    Parameters
    ----------
    string: str
    model: Object, optional (default=None)
        Transformer model or any model has `attention` method.
    vectorizer: Object, optional (default=None)
        Prefer `sklearn.feature_extraction.text.CountVectorizer` or,
        `malaya.text.vectorizer.SkipGramCountVectorizer`.
        If None, will generate ngram automatically based on `stopwords`.
    top_k: int, optional (default=5)
        return top-k results.
    ngram: tuple, optional (default=(1,1))
        n-grams size.
    atleast: int, optional (default=1)
        at least count appeared in the string to accept as candidate.
    stopwords: List[str], (default=malaya.texts.function.get_stopwords)
        A callable that returned a List[str], or a List[str], or a Tuple[str]
        For automatic Ngram generator.

    Returns
    -------
    result: Tuple[float, str]
    """

auto-ngram

This will auto generated N-size ngram for keyword candidates.

[4]:
malaya.keyword_extraction.rake(string)
[4]:
[(0.11666666666666665, 'ditandatangani Muhammad Suhaimi bertarikh Mei'),
 (0.08888888888888888, 'mesyuarat Majlis Pimpinan Tertinggi'),
 (0.08888888888888888, 'Seksyen C Akta Pertubuhan'),
 (0.05138888888888888, 'parti bertarikh Jun'),
 (0.04999999999999999, 'keahlian Muhyiddin Hamzah')]

auto-gram with Attention

This will use attention mechanism as the scores. I will use small-electra in this example.

[5]:
electra = malaya.transformer.load(model = 'small-electra')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:56: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/modeling.py:240: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:79: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:93: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/sampling.py:26: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:115: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:118: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:119: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:121: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:122: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:128: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:130: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/small/electra-small/model.ckpt
[6]:
malaya.keyword_extraction.rake(string, model = electra)
[6]:
[(0.2113546236771915, 'ditandatangani Muhammad Suhaimi bertarikh Mei'),
 (0.1707678455680971, 'terlucut berkuat kuasa'),
 (0.16650756665229807, 'Muhammad Suhaimi'),
 (0.1620429894692799, 'mesyuarat Majlis Pimpinan Tertinggi'),
 (0.08333952583953884, 'Seksyen C Akta Pertubuhan')]

using vectorizer

[7]:
from malaya.text.vectorizer import SkipGramCountVectorizer

stopwords = malaya.text.function.get_stopwords()
vectorizer = SkipGramCountVectorizer(
    token_pattern = r'[\S]+',
    ngram_range = (1, 3),
    stop_words = stopwords,
    lowercase = False,
    skip = 2
)
[8]:
malaya.keyword_extraction.rake(string, vectorizer = vectorizer)
[8]:
[(0.0017052987393271276, 'parti memohon perisytiharan'),
 (0.0017036368782590756, 'memohon perisytiharan BERSATU'),
 (0.0017012023597074357, 'memohon perisytiharan sah'),
 (0.0017012023597074357, 'sah memohon perisytiharan'),
 (0.0016992809994779549, 'perisytiharan BERSATU sah')]

fixed-ngram with Attention

[9]:
malaya.keyword_extraction.rake(string, model = electra, vectorizer = vectorizer)
[9]:
[(0.011575972342905336, 'Suhaimi terlucut kuasa'),
 (0.011181842074981322, 'Suhaimi terlucut berkuat'),
 (0.011115820862501402, 'Hamzah Suhaimi terlucut'),
 (0.011088260762034929, 'Muhammad Suhaimi terlucut'),
 (0.010932737717462946, 'Suhaimi BERSATU terlucut')]

Use Textrank algorithm

Malaya simply use textrank algorithm.

def textrank(
    string: str,
    model = None,
    vectorizer = None,
    top_k: int = 5,
    atleast: int = 1,
    stopwords = get_stopwords,
    **kwargs
):
    """
    Extract keywords using Textrank algorithm.

    Parameters
    ----------
    string: str
    model: Object, optional (default='None')
        model has `fit_transform` or `vectorize` method.
    vectorizer: Object, optional (default=None)
        Prefer `sklearn.feature_extraction.text.CountVectorizer` or,
        `malaya.text.vectorizer.SkipGramCountVectorizer`.
        If None, will generate ngram automatically based on `stopwords`.
    top_k: int, optional (default=5)
        return top-k results.
    atleast: int, optional (default=1)
        at least count appeared in the string to accept as candidate.
    stopwords: List[str], (default=malaya.texts.function.get_stopwords)
        A callable that returned a List[str], or a List[str], or a Tuple[str]

    Returns
    -------
    result: Tuple[float, str]
    """
[10]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

auto-ngram with TFIDF

This will auto generated N-size ngram for keyword candidates.

[11]:
malaya.keyword_extraction.textrank(string, model = tfidf)
[11]:
[(0.00015733542072521276, 'plaintif memohon perisytiharan'),
 (0.00012558967703709954, 'Fasal perlembagaan parti'),
 (0.00011514137183023093, 'Fasal Perlembagaan BERSATU'),
 (0.00011505528232050447, 'parti'),
 (0.00010763519022276223, 'memohon perisytiharan')]

auto-ngram with Attention

This will auto generated N-size ngram for keyword candidates.

[12]:
electra = malaya.transformer.load(model = 'small-electra')
albert = malaya.transformer.load(model = 'albert')
INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/small/electra-small/model.ckpt
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/albert/tokenization.py:240: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:loading sentence piece model
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/albert/modeling.py:116: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/albert/modeling.py:588: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.

WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/albert/modeling.py:1025: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/albert-model/base/albert-base/model.ckpt
[13]:
malaya.keyword_extraction.textrank(string, model = electra)
[13]:
[(6.318266041614872e-05, 'dipohon perisytiharan'),
 (6.316746526248747e-05, 'pemegang jawatan'),
 (6.31611903536171e-05, 'parti bertarikh Jun'),
 (6.31610445866738e-05, 'Februari'),
 (6.315819101361123e-05, 'plaintif')]
[14]:
malaya.keyword_extraction.textrank(string, model = albert)
[14]:
[(7.964653918577322e-05, 'Fasal Perlembagaan BERSATU'),
 (7.746139285912213e-05, 'mesyuarat Majlis Pimpinan Tertinggi'),
 (7.522448051439215e-05, 'Muhammad Suhaimi'),
 (7.520443897301994e-05, 'pengesahan'),
 (7.519602319474711e-05, 'terbatal Plaintif')]

Or you can use any classification model to find keywords sensitive towards to specific domain.

[15]:
sentiment = malaya.sentiment.transformer(model = 'xlnet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/function/__init__.py:76: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

[16]:
malaya.keyword_extraction.textrank(string, model = sentiment)
[16]:
[(6.698925306115632e-05, 'ahli BERSATU'),
 (6.675329349228935e-05, 'plaintif memohon perisytiharan'),
 (6.483194243100408e-05, 'melucutkan memberhentikan keahlian'),
 (6.471105464624579e-05, 'mesyuarat Majlis Pimpinan Tertinggi'),
 (6.467850486969276e-05, 'ditandatangani Muhammad Suhaimi bertarikh Mei')]

fixed-ngram with Attention

[17]:
stopwords = malaya.text.function.get_stopwords()
vectorizer = SkipGramCountVectorizer(
    token_pattern = r'[\S]+',
    ngram_range = (1, 3),
    stop_words = stopwords,
    lowercase = False,
    skip = 2
)
[18]:
malaya.keyword_extraction.textrank(string, model = electra, vectorizer = vectorizer)
[18]:
[(5.652169287708196e-09, 'plaintif perisytiharan'),
 (5.652075506278682e-09, 'perisytiharan ahli sah'),
 (5.651996154832122e-09, 'Plaintif perisytiharan keahlian'),
 (5.651931921600406e-09, 'Perisytiharan'),
 (5.651703273185467e-09, 'plaintif memohon perisytiharan')]
[19]:
malaya.keyword_extraction.textrank(string, model = albert, vectorizer = vectorizer)
[19]:
[(7.23758580900875e-09, 'Perisytiharan Fasal Perlembagaan'),
 (7.237124467070075e-09, 'Fasal Perlembagaan melucutkan'),
 (7.234613418160024e-09, 'Pimpinan Tertinggi (MPT)'),
 (7.231803194224148e-09, 'Majlis Pimpinan (MPT)'),
 (7.231487343952181e-09, 'Perisytiharan Fasal BERSATU')]

Load Attention mechanism

Use attention mechanism to get important keywords.

def attention(
    string: str,
    model,
    vectorizer = None,
    top_k: int = 5,
    atleast: int = 1,
    stopwords = get_stopwords,
    **kwargs
):
    """
    Extract keywords using Attention mechanism.

    Parameters
    ----------
    string: str
    model: Object
        Transformer model or any model has `attention` method.
    vectorizer: Object, optional (default=None)
        Prefer `sklearn.feature_extraction.text.CountVectorizer` or,
        `malaya.text.vectorizer.SkipGramCountVectorizer`.
        If None, will generate ngram automatically based on `stopwords`.
    top_k: int, optional (default=5)
        return top-k results.
    atleast: int, optional (default=1)
        at least count appeared in the string to accept as candidate.
    stopwords: List[str], (default=malaya.texts.function.get_stopwords)
        A callable that returned a List[str], or a List[str], or a Tuple[str]

    Returns
    -------
    result: Tuple[float, str]
    """

auto-ngram

This will auto generated N-size ngram for keyword candidates.

[20]:
malaya.keyword_extraction.attention(string, model = electra)
[20]:
[(0.9452064568002397, 'menghalang pelupusan pertikaian'),
 (0.007486688404188947, 'Fasal Perlembagaan BERSATU'),
 (0.005130747276971111, 'ahli BERSATU'),
 (0.005036595631722718, 'melucutkan memberhentikan keahlian'),
 (0.004883706288857347, 'BERSATU')]
[21]:
malaya.keyword_extraction.attention(string, model = albert)
[21]:
[(0.16196368022187793, 'plaintif memohon perisytiharan'),
 (0.09294065744319371, 'memohon perisytiharan'),
 (0.06902302277868422, 'plaintif'),
 (0.05584840295920779, 'ditandatangani Muhammad Suhaimi bertarikh Mei'),
 (0.05206225590337424, 'dipohon perisytiharan')]

fixed-ngram

[22]:
malaya.keyword_extraction.attention(string, model = electra, vectorizer = vectorizer)
[22]:
[(0.037611191435411966, 'pertikaian mahkamah Perlembagaan'),
 (0.037571215711288866, 'pertikaian mahkamah Fasal'),
 (0.0375634142013458, 'terpakai pertikaian mahkamah'),
 (0.03756289802628609, 'menghalang pertikaian mahkamah'),
 (0.03756143645898762, 'pelupusan pertikaian mahkamah')]
[23]:
malaya.keyword_extraction.attention(string, model = albert, vectorizer = vectorizer)
[23]:
[(0.007390033406455312, 'saman plaintif memohon'),
 (0.006895206525865519, 'Dalam plaintif memohon'),
 (0.006638398338567768, 'plaintif memohon BERSATU'),
 (0.006223140839798238, 'Dalam saman memohon'),
 (0.0061965713344477175, 'plaintif memohon perisytiharan')]
[ ]: