Prefix Generator#

Give initial sentence, then the models will continue to generate the text.

This tutorial is available as an IPython notebook at Malaya/example/prefix-generator.

[1]:
%%time
import malaya
from pprint import pprint
CPU times: user 4.89 s, sys: 684 ms, total: 5.57 s
Wall time: 4.64 s

Load GPT2#

Malaya provided Pretrained GPT2 model, specific to Malay, we called it GPT2-Bahasa. This interface not able us to use it to do custom training.

GPT2-Bahasa was pretrained on ~1.2 billion words.

If you want to download pretrained model for GPT2-Bahasa and use it for custom transfer-learning, you can download it here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/gpt2, some notebooks to help you get started.

List available GPT2#

[2]:
malaya.generator.available_gpt2()
INFO:root:calculate perplexity on never seen malay karangan.
[2]:
Size (MB) Quantized Size (MB) Perplexity
117M 499.0 126.0 6.232461
345M 1420.0 357.0 6.104012

load model#

def gpt2(model: str = '345M', quantized: bool = False, **kwargs):
    """
    Load GPT2 model to generate a string given a prefix string.

    Parameters
    ----------
    model : str, optional (default='345M')
        Model architecture supported. Allowed values:

        * ``'117M'`` - GPT2 117M parameters.
        * ``'345M'`` - GPT2 345M parameters.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: malaya.model.tf.GPT2 class
    """
[3]:
model = malaya.generator.gpt2(model = '117M')
INFO:root:running gpt2/117M using device /device:CPU:0
[4]:
model_quantized = malaya.generator.gpt2(model = '117M', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running gpt2/117M-quantized using device /device:CPU:0
[5]:
string = 'ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku nampak cerita seram, '

generate#

"""
    generate a text given an initial string.

    Parameters
    ----------
    string : str
    maxlen : int, optional (default=256)
        length of sentence to generate.
    n_samples : int, optional (default=1)
        size of output.
    temperature : float, optional (default=1.0)
        temperature value, value should between 0 and 1.
    top_k : int, optional (default=0)
        top-k in nucleus sampling selection.
    top_p : float, optional (default=0.0)
        top-p in nucleus sampling selection, value should between 0 and 1.
        if top_p == 0, will use top_k.
        if top_p == 0 and top_k == 0, use greedy decoder.

    Returns
    -------
    result: List[str]
    """
[6]:
print(model.generate(string, temperature = 0.1))
["ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku nampak cerita seram, ia adalah rancangan anak-anak yang aku bawa balik sampanye.\nSekali lagi aku hanya akan meminta dia jadi pembenci, dan memanggil supaya aku boleh berkata.\nKata-kata itu hanya diberikan pada adik aku; selepas itu pada aku terus terang, saudara-saudara aku hanya masu melihat betapa serngah aku mencipta naluri berkenaan.\nDia memakai kacamata untuk berjaga dan berwuduk dan menggesel dirinya hampir untuk berjalan.\nDia mempersembahkan rambutnya, daun berasal dari keduanya dan berselerak gila.\n'Tahi' seorang lelaki dunia pun tak berapa malah pakaiannya hanya menyapu ke bibir dan seluruh kulit aku menjadi panas.\nDengar, sop"]
[7]:
print(model.generate(string, temperature = 0.1, top_p = 0.8))
['ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku nampak cerita seram, iaitu cerita pertama yang menjadi viral, walaupun nama saya tidak kedengaran keras.\n"Aku pun memang tak rasa duduk di sini sambil mencari makanan, tengok kami tak boleh pergi mana-mana-mana ke sana (yang aku tidak mahu pergi).\nKalau nak buka bersama, aku tak boleh.\n"Tapi sebab aku tak nampak jelas, ramai orang menyebut perkataan yang kita tidak suka, mereka suka mengecam," katanya ketika ditemui pada majlis rasmi sebuah televisyen yang disiarkan secara langsung oleh Malaysiakini di Media, baru-baru ini.\nPada masa sama, Perdana Menteri, Tun Dr Mahathir Mohamad mengucapkan takziah atas kematian Adun Aman, Al-Ihsan dan Hajar Tahir, man']

Using Babble method#

We also can generate a text like GPT2 using Transformer-Bahasa. Right now only supported BERT, ALBERT and ELECTRA.

def babble(
    string: str,
    model,
    generate_length: int = 30,
    leed_out_len: int = 1,
    temperature: float = 1.0,
    top_k: int = 100,
    burnin: int = 15,
    batch_size: int = 5,
):
    """
    Use pretrained transformer models to generate a string given a prefix string.
    https://github.com/nyu-dl/bert-gen, https://arxiv.org/abs/1902.04094

    Parameters
    ----------
    string: str
    model: object
        transformer interface object. Right now only supported BERT, ALBERT.
    generate_length : int, optional (default=256)
        length of sentence to generate.
    leed_out_len : int, optional (default=1)
        length of extra masks for each iteration.
    temperature: float, optional (default=1.0)
        logits * temperature.
    top_k: int, optional (default=100)
        k for top-k sampling.
    burnin: int, optional (default=15)
        for the first burnin steps, sample from the entire next word distribution, instead of top_k.
    batch_size: int, optional (default=5)
        generate sentences size of batch_size.

    Returns
    -------
    result: List[str]
    """

Make sure you already installed tensorflow-probability,

pip3 install tensorflow-probability==0.7.0
[10]:
# !pip3 install tensorflow-probability==0.7.0
[3]:
electra = malaya.transformer.load(model = 'electra')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:56: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/modeling.py:242: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/layers/core.py:187: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:79: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:93: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/sampling.py:26: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:115: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:118: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:119: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:121: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:122: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:128: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/Malaya/malaya/transformers/electra/__init__.py:130: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from /Users/huseinzolkepli/Malaya/electra-model/base/electra-base/model.ckpt
[11]:
malaya.generator.babble(string, electra)
[11]:
['ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , terseksa juga hidup di sekeliling aku . Diorang tak tahu sebab diorang tahu titik hitam yang mana kita tengok dari mana kita sendiri nampak cerita ke . Haih .',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , tengah baca benda besar pasal bumbung bilik . Rasanya sejuk macam pulau harapan . So aku baca cerita seram pelik . Jadi sedih juga dengar cerita seram seram ni .',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , lalu ibu ambil pusing bagi buku sejarah . Dah baca marsh pastu aku dah buat thread seram , ada dalam masa terdekat baru bangun . Sedih , hidup lagi',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , mesti seram sampai aku ikut takdir Allah bagi betul2 aib kita kembali menulis mengenai kisah cinta aku ini malam , aku tersedar selepas ada seorang lelaki tersedar .',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , sedangkan yang baca pasal negara berpagarism memang patut berterima kasih . Kata ayah , ingatkan boleh mandi atau bilik pun boleh , kena air dan bukannya ikut kemampuan']

ngrams#

You can generate ngrams pretty easy using this interface,

def ngrams(
    sequence,
    n: int,
    pad_left = False,
    pad_right = False,
    left_pad_symbol = None,
    right_pad_symbol = None,
):
    """
    generate ngrams.

    Parameters
    ----------
    sequence : List[str]
        list of tokenize words.
    n : int
        ngram size

    Returns
    -------
    ngram: list
    """
[6]:
string = 'saya suka makan ayam'

list(malaya.generator.ngrams(string.split(), n = 2))
[6]:
[('saya', 'suka'), ('suka', 'makan'), ('makan', 'ayam')]
[7]:
list(malaya.generator.ngrams(string.split(), n = 2, pad_left = True, pad_right = True))
[7]:
[(None, 'saya'),
 ('saya', 'suka'),
 ('suka', 'makan'),
 ('makan', 'ayam'),
 ('ayam', None)]
[8]:
list(malaya.generator.ngrams(string.split(), n = 2, pad_left = True, pad_right = True,
                            left_pad_symbol = 'START'))
[8]:
[('START', 'saya'),
 ('saya', 'suka'),
 ('suka', 'makan'),
 ('makan', 'ayam'),
 ('ayam', None)]
[8]:
list(malaya.generator.ngrams(string.split(), n = 2, pad_left = True, pad_right = True,
                            left_pad_symbol = 'START', right_pad_symbol = 'END'))
[8]:
[('START', 'saya'),
 ('saya', 'suka'),
 ('suka', 'makan'),
 ('makan', 'ayam'),
 ('ayam', 'END')]