Prefix Generator#

Give initial sentence, then the models will continue to generate the text.

This tutorial is available as an IPython notebook at Malaya/example/prefix-generator.

This interface deprecated, use HuggingFace interface instead.

[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
[2]:
%%time
import malaya
from pprint import pprint
CPU times: user 3.08 s, sys: 2.99 s, total: 6.06 s
Wall time: 2.35 s
[3]:
import warnings
warnings.filterwarnings('default')

List available Transformer#

[4]:
malaya.generator.prefix.available_transformer()
/home/husein/dev/malaya/malaya/generator/prefix.py:116: DeprecationWarning: `malaya.generator.prefix.available_transformer` is deprecated, use `malaya.generator.prefix.available_huggingface` instead
  warnings.warn(
[4]:
Size (MB) Quantized Size (MB) Perplexity
117M 499.0 126.0 6.232461
345M 1420.0 357.0 6.104012

If you want to download pretrained model for GPT2-Bahasa and use it for custom transfer-learning, you can download it here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/gpt2, some notebooks to help you get started.

load Transformer model#

def transformer(model: str = '345M', quantized: bool = False, **kwargs):
    """
    Load GPT2 model to generate a string given a prefix string.

    Parameters
    ----------
    model: str, optional (default='345M')
        Check available models at `malaya.generator.prefix.available_transformer()`.
    quantized: bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: malaya.model.tf.GPT2 class
    """
[14]:
model = malaya.generator.prefix.transformer(model = '117M')
[9]:
string = 'ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku nampak cerita seram, '

generate#

"""
    generate a text given an initial string.

    Parameters
    ----------
    string : str
    maxlen : int, optional (default=256)
        length of sentence to generate.
    n_samples : int, optional (default=1)
        size of output.
    temperature : float, optional (default=1.0)
        temperature value, value should between 0 and 1.
    top_k : int, optional (default=0)
        top-k in nucleus sampling selection.
    top_p : float, optional (default=0.0)
        top-p in nucleus sampling selection, value should between 0 and 1.
        if top_p == 0, will use top_k.
        if top_p == 0 and top_k == 0, use greedy decoder.

    Returns
    -------
    result: List[str]
    """
[7]:
print(model.generate(string, temperature = 0.1))
2022-11-18 14:35:42.868983: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 154389504 exceeds 10% of free system memory.
2022-11-18 14:35:42.915844: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 154389504 exceeds 10% of free system memory.
2022-11-18 14:35:43.297312: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 154389504 exceeds 10% of free system memory.
2022-11-18 14:35:43.350788: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 154389504 exceeds 10% of free system memory.
2022-11-18 14:35:46.038577: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" vendor: "GenuineIntel" model: "103" frequency: 2112 num_cores: 20 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.3.90" } l1_cache_size: 49152 l2_cache_size: 1310720 l3_cache_size: 26214400 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2022-11-18 14:35:47.187100: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 154389504 exceeds 10% of free system memory.
['ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku nampak cerita seram, iaitu "kita semua sahaja.\nSungguh aku sikit asyik sangat".\nPun begitu, ujar Lisa, yang bujang beri aku dan layang-layang memikir aku pulang sehari penuh yang bisa kamu lupakan dulu.\nBisa jadi muatan derasmu tetapi juga tersembunyi.\nIngat, aku berantem dan paham, memandang seorang insan tua yang super kuat.\nBaca pungging ia mengunjungi setiap penjuru gendang, jamuan, dan rombongan ikut berhari raya.\nBaca kisah Natasha itu, ia juga pernah jaga semua lima juapun, tapi hal seramnya menjadi sangat kencang disuap.\nSolusinya, aku bisa memastikan lemak lama dan lemung dari tombol dan bentuk tubuh memberikan kesan dan prosesnya bahkan dapat mer']
[8]:
print(model.generate(string, temperature = 0.1, top_p = 0.8))
['ceritanya sebegini, aku bangun pagi baca surat khabar berita harian, tetiba aku nampak cerita seram, iaitu suara gemuruh nafsu yang semakin hampir untuk berbunyi.\nMenurut cerita kemudian, mereka tidak dapat mengingati wajah itu lagi dan ia tidak dapat dilihat oleh mata, walaupun sudah tahu apa wajah itu.\nBagaimana mungkin keadaan mereka?\nMengapa mereka mengingati mimpi yang ditinggalkan oleh orang ramai sebelum itu, sedangkan kesan daripada kegelapan sedemikian?\nMimpi kusam dan rahsia tiba.\nMereka juga mula menangis sekali lagi dan akhirnya tidur berang.\nTiba-tiba seorang lelaki sedang menangis menceritakan kisahnya.\nMimpi jahat yang menimpa Allah SWT ini bukanlah kehidupan seseorang pun dahulu.\nBagaimana m']

Using Babble method#

We also can generate a text like GPT2 using Transformer-Bahasa. Right now only supported BERT, ALBERT and ELECTRA.

def babble_tf(
    string: str,
    model,
    generate_length: int = 30,
    leed_out_len: int = 1,
    temperature: float = 1.0,
    top_k: int = 100,
    burnin: int = 15,
    batch_size: int = 5,
):
    """
    Use pretrained malaya transformer models to generate a string given a prefix string.
    https://github.com/nyu-dl/bert-gen, https://arxiv.org/abs/1902.04094

    Parameters
    ----------
    string: str
    model: object
        transformer interface object. Right now only supported BERT, ALBERT and ELECTRA.
    generate_length: int, optional (default=256)
        length of sentence to generate.
    leed_out_len: int, optional (default=1)
        length of extra masks for each iteration.
    temperature: float, optional (default=1.0)
        logits * temperature.
    top_k: int, optional (default=100)
        k for top-k sampling.
    burnin: int, optional (default=15)
        for the first burnin steps, sample from the entire next word distribution, instead of top_k.
    batch_size: int, optional (default=5)
        generate sentences size of batch_size.

    Returns
    -------
    result: List[str]
    """

Make sure you already installed tensorflow-probability,

pip3 install tensorflow-probability==0.7.0
[10]:
# !pip3 install tensorflow-probability==0.7.0
[11]:
electra = malaya.transformer.load(model = 'electra')
/home/husein/dev/malaya/malaya/transformer.py:132: DeprecationWarning: `malaya.transformer.load` is deprecated, use `malaya.transformer.huggingface` instead
  warnings.warn('`malaya.transformer.load` is deprecated, use `malaya.transformer.huggingface` instead', DeprecationWarning)
Load pretrained transformer electra model will disable eager execution.
/home/husein/.local/lib/python3.8/site-packages/keras/legacy_tf_layers/core.py:236: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.
  warnings.warn('`tf.layers.dense` is deprecated and '
/home/husein/.local/lib/python3.8/site-packages/keras/engine/base_layer_v1.py:1676: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '
WARNING:tensorflow:From /home/husein/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:206: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
INFO:tensorflow:Restoring parameters from /home/husein/Malaya/electra-model/base/electra-base/model.ckpt
[13]:
malaya.generator.prefix.babble_tf(string, electra)
/home/husein/.local/lib/python3.8/site-packages/tensorflow_probability/python/internal/backend/numpy/numpy_array.py:281: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def _sequence_mask(lengths, maxlen=None, dtype=np.bool, name=None):  # pylint: disable=unused-argument
/home/husein/.local/lib/python3.8/site-packages/tensorflow_probability/python/internal/backend/numpy/dtype.py:82: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  bool = np.bool  # pylint: disable=redefined-builtin
/home/husein/.local/lib/python3.8/site-packages/tensorflow_probability/python/internal/backend/numpy/dtype.py:112: DeprecationWarning: `np.str` is a deprecated alias for the builtin `str`. To silence this warning, use `str` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.str_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  string = getattr(np, 'str', getattr(np, 'string', None))
/home/husein/.local/lib/python3.8/site-packages/tensorflow_probability/python/mcmc/sample_halton_sequence.py:373: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  sieve = np.ones(n // 3 + (n % 6 == 2), dtype=np.bool)
/home/husein/.local/lib/python3.8/site-packages/tensorflow_probability/python/internal/backend/numpy/ops.py:301: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if dtype == np.bool:
/home/husein/.local/lib/python3.8/site-packages/tensorflow/python/client/session.py:1766: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
  warnings.warn('An interactive session is already active. This can '
/home/husein/dev/malaya/malaya/text/bpe.py:896: RuntimeWarning: invalid value encountered in true_divide
  weights = weights / np.sum(weights)
[13]:
['ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , sedih , gelakkan orang lain yang lari ke mana2 sedangkan aku sendiri pun tahu semua gambar mesti muncul , kalau kau tak nampak , lagilah aku mimpi tidur',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , gelak , gelak , menangis , gelak tak diduga , gelak2 . Aku turut terharu , nampak berita tak dijangka , terkejut aku tak diduga . Terima kasih .',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , rupanya ramai orang buat mereka seperti alien . Tapi peliknya , saditi akhirnya berlaku . Ya , misteri inilah jadi persoalan disebalik rumus , bacalah .',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , seru seruan penumpang kapal terbang yang telah berangkat menunaikan solat sunat puasa . Kisah benar kisah pasal hidup dan masa dia solat sunat puasa tu mesti ada cerita seram .',
 'ceritanya sebegini , aku bangun pagi baca surat khabar berita harian , tetiba aku nampak cerita seram , cicak ledang loceng belakang rumah , keret , lampu tutup , lampu terung , kucing aku , orang macam ini , tak ada tempat tersembunyi !']