Transformer#

This tutorial is available as an IPython notebook at Malaya/example/transformer.

Below are the list of dataset we pretrained,

Standard Bahasa dataset,

  1. Malay-dataset/dumping.

  2. Malay-dataset/pure-text.

Bahasa social media,

  1. Malay-dataset/dumping/instagram.

  2. Malay-dataset/dumping/twitter.

Singlish / Manglish,

  1. Malay-dataset/dumping/singlish.

  2. Malay-dataset/dumping/singapore-news.

This interface not able us to use it to do custom training.

If you want to download pretrained model for Transformer-Bahasa and use it for custom transfer-learning, you can download it here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/, some notebooks to help you get started.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
%%time
import malaya
CPU times: user 3.16 s, sys: 3.4 s, total: 6.56 s
Wall time: 2.25 s
[3]:
import warnings
warnings.filterwarnings('default')

list Transformer available#

[4]:
malaya.transformer.available_transformer()
/home/husein/dev/malaya/malaya/transformer.py:90: DeprecationWarning: `malaya.transformer.available_transformer` is deprecated, use `malaya.transformer.available_huggingface` instead
  warnings.warn(
[4]:
Size (MB) Description
bert 425.6 Google BERT BASE parameters
tiny-bert 57.4 Google BERT TINY parameters
albert 48.6 Google ALBERT BASE parameters
tiny-albert 22.4 Google ALBERT TINY parameters
xlnet 446.6 Google XLNET BASE parameters
alxlnet 46.8 Malaya ALXLNET BASE parameters
electra 443 Google ELECTRA BASE parameters
small-electra 55 Google ELECTRA SMALL parameters
[5]:
strings = ['Kerajaan galakkan rakyat naik public transport tapi parking kat lrt ada 15. Reserved utk staff rapid je dah berpuluh. Park kereta tepi jalan kang kene saman dgn majlis perbandaran. Kereta pulak senang kene curi. Cctv pun tak ada. Naik grab dah 5-10 ringgit tiap hari. Gampang juga',
           'Alaa Tun lek ahhh npe muka masam cmni kn agong kata usaha kerajaan terdahulu sejak selepas merdeka',
           "Orang ramai cakap nurse kerajaan garang. So i tell u this. Most of our local ppl will treat us as hamba abdi and they don't respect us as a nurse"]

Load XLNET-Bahasa#

def load(model: str = 'electra', pool_mode: str = 'last', **kwargs):
    """
    Load transformer model.

    Parameters
    ----------
    model: str, optional (default='bert')
        Check available models at `malaya.transformer.available_transformer()`.
    pool_mode: str, optional (default='last')
        Model logits architecture supported. Only usable if model in ['xlnet', 'alxlnet']. Allowed values:

        * ``'last'`` - last of the sequence.
        * ``'first'`` - first of the sequence.
        * ``'mean'`` - mean of the sequence.
        * ``'attn'`` - attention of the sequence.

    Returns
    -------
    result: model
        List of model classes:

        * if `bert` in model, will return `malaya.transformers.bert.Model`.
        * if `xlnet` in model, will return `malaya.transformers.xlnet.Model`.
        * if `albert` in model, will return `malaya.transformers.albert.Model`.
        * if `electra` in model, will return `malaya.transformers.electra.Model`.
    """
[6]:
xlnet = malaya.transformer.load(model = 'xlnet')
/home/husein/dev/malaya/malaya/transformer.py:132: DeprecationWarning: `malaya.transformer.load` is deprecated, use `malaya.transformer.huggingface` instead
  warnings.warn('`malaya.transformer.load` is deprecated, use `malaya.transformer.huggingface` instead', DeprecationWarning)
Load pretrained transformer xlnet model will disable eager execution.
INFO:tensorflow:memory input None
INFO:tensorflow:Use float type <dtype: 'float32'>
2022-11-11 20:24:45.251161: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-11 20:24:45.256505: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-11 20:24:45.256524: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-11-11 20:24:45.256528: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2022-11-11 20:24:45.256600: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-11-11 20:24:45.256620: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3
/home/husein/.local/lib/python3.8/site-packages/keras/legacy_tf_layers/core.py:393: UserWarning: `tf.layers.dropout` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dropout` instead.
  warnings.warn('`tf.layers.dropout` is deprecated and '
/home/husein/.local/lib/python3.8/site-packages/keras/engine/base_layer_v1.py:1676: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '
/home/husein/.local/lib/python3.8/site-packages/keras/legacy_tf_layers/core.py:236: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.
  warnings.warn('`tf.layers.dense` is deprecated and '
INFO:tensorflow:Restoring parameters from /home/husein/Malaya/xlnet-model/base/xlnet-base/model.ckpt
2022-11-11 20:24:46.265884: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 98304000 exceeds 10% of free system memory.
2022-11-11 20:24:46.452546: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 98304000 exceeds 10% of free system memory.

I have random sentences copied from Twitter, searched using kerajaan keyword.

Vectorization#

Change a string or batch of strings to latent space / vectors representation.

def vectorize(self, strings: List[str]):

    """
    Vectorize string inputs.

    Parameters
    ----------
    strings : List[str]

    Returns
    -------
    result: np.array
    """
[7]:
v = xlnet.vectorize(strings)
v.shape
[7]:
(3, 768)

Attention#

def attention(self, strings: List[str], method: str = 'last', **kwargs):
    """
    Get attention string inputs from bert attention.

    Parameters
    ----------
    strings : List[str]
    method : str, optional (default='last')
        Attention layer supported. Allowed values:

        * ``'last'`` - attention from last layer.
        * ``'first'`` - attention from first layer.
        * ``'mean'`` - average attentions from all layers.

    Returns
    -------
    result : List[List[Tuple[str, float]]]
    """

You can give list of strings or a string to get the attention, in this documentation, I just want to use a string.

[8]:
xlnet.attention([strings[1]], method = 'last')
[8]:
[[('Alaa', 0.062061906),
  ('Tun', 0.051056832),
  ('lek', 0.1311541),
  ('ahhh', 0.08195937),
  ('npe', 0.062106956),
  ('muka', 0.047061794),
  ('masam', 0.058289334),
  ('cmni', 0.026094288),
  ('kn', 0.05614683),
  ('agong', 0.033949904),
  ('kata', 0.05264412),
  ('usaha', 0.07063399),
  ('kerajaan', 0.046773825),
  ('terdahulu', 0.057166424),
  ('sejak', 0.0457128),
  ('selepas', 0.0470482),
  ('merdeka', 0.070139356)]]
[9]:
xlnet.attention([strings[1]], method = 'first')
[9]:
[[('Alaa', 0.045956098),
  ('Tun', 0.04009481),
  ('lek', 0.061107174),
  ('ahhh', 0.07029097),
  ('npe', 0.048513662),
  ('muka', 0.05667023),
  ('masam', 0.040880706),
  ('cmni', 0.08728455),
  ('kn', 0.047778476),
  ('agong', 0.081243195),
  ('kata', 0.038660403),
  ('usaha', 0.058326434),
  ('kerajaan', 0.05544658),
  ('terdahulu', 0.07716211),
  ('sejak', 0.059514306),
  ('selepas', 0.05385497),
  ('merdeka', 0.07721528)]]
[10]:
xlnet.attention([strings[1]], method = 'mean')
[10]:
[[('Alaa', 0.06978633),
  ('Tun', 0.05174421),
  ('lek', 0.059642658),
  ('ahhh', 0.055883665),
  ('npe', 0.053392064),
  ('muka', 0.06806308),
  ('masam', 0.048992105),
  ('cmni', 0.06981932),
  ('kn', 0.057752043),
  ('agong', 0.06556668),
  ('kata', 0.059152912),
  ('usaha', 0.0633051),
  ('kerajaan', 0.05060846),
  ('terdahulu', 0.05888331),
  ('sejak', 0.057429556),
  ('selepas', 0.042058237),
  ('merdeka', 0.06792031)]]

Visualize Attention#

Before using attention visualization, we need to load D3 into our jupyter notebook first. This visualization borrow from https://github.com/jessevig/bertviz .

def visualize_attention(self, string: str):

    """
    Visualize attention.

    Parameters
    ----------
    string : str
    """
[11]:
%%javascript
require.config({
  paths: {
      d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min',
      jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
  }
});
[12]:
xlnet.visualize_attention('nak makan ayam dgn husein')
Layer:
/home/husein/dev/malaya/malaya/function/html.py:391: ResourceWarning: unclosed file <_io.TextIOWrapper name='/home/husein/dev/malaya/malaya/function/web/static/head_view.js' mode='r' encoding='UTF-8'>
  vis_js = open(
ResourceWarning: Enable tracemalloc to get the object allocation traceback

All attention models able to use these interfaces.

Load ELECTRA-Bahasa#

Feel free to use another models.

[13]:
electra = malaya.transformer.load(model = 'electra')
WARNING:tensorflow:From /home/husein/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:206: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
WARNING:tensorflow:From /home/husein/.local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:206: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.random.categorical` instead.
2022-11-11 20:24:50.920389: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 98304000 exceeds 10% of free system memory.
INFO:tensorflow:Restoring parameters from /home/husein/Malaya/electra-model/base/electra-base/model.ckpt
INFO:tensorflow:Restoring parameters from /home/husein/Malaya/electra-model/base/electra-base/model.ckpt
2022-11-11 20:24:51.094119: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 98304000 exceeds 10% of free system memory.
[14]:
electra.attention([strings[1]], method = 'last')
[14]:
[[('Alaa', 0.05981715),
  ('Tun', 0.075028375),
  ('lek', 0.057848394),
  ('ahhh', 0.046973255),
  ('npe', 0.051608335),
  ('muka', 0.06221235),
  ('masam', 0.058585603),
  ('cmni', 0.054711338),
  ('kn', 0.06741889),
  ('agong', 0.056326743),
  ('kata', 0.05418279),
  ('usaha', 0.07986903),
  ('kerajaan', 0.055595957),
  ('terdahulu', 0.052879248),
  ('sejak', 0.04999219),
  ('selepas', 0.053916227),
  ('merdeka', 0.063034184)]]