Different Precision#

This tutorial is available as an IPython notebook at Malaya/example/different-precision.

Read more at https://huggingface.co/docs/diffusers/optimization/fp16#half-precision-weights

[1]:
%%time

import malaya
import logging
logging.basicConfig(level = logging.INFO)
CPU times: user 2.88 s, sys: 3.46 s, total: 6.34 s
Wall time: 2.21 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[2]:
import torch
[3]:
# https://discuss.pytorch.org/t/finding-model-size/130275

def get_model_size_mb(model):
    param_size = 0
    for param in model.model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()
    return (param_size + buffer_size) / 1024**2

Load default precision, FP32#

[5]:
model = malaya.translation.huggingface(model = 'mesolitica/translation-t5-small-standard-bahasa-cased')
Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[6]:
get_model_size_mb(model)
[6]:
230.765625
[7]:
model.generate(['i like chicken'])
/home/husein/.local/lib/python3.8/site-packages/transformers/generation/utils.py:1260: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[7]:
['Saya suka ayam']

Load FP16#

Only worked on GPU.

[9]:
model = malaya.translation.huggingface(model = 'mesolitica/translation-t5-small-standard-bahasa-cased',
                                            torch_dtype=torch.float16)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[10]:
get_model_size_mb(model)
[10]:
139.3828125

Load INT8#

Required latest version accelerate and bitsandbytes,

pip3 install accelerate bitsandbytes

Only worked on GPU.

[12]:
model = malaya.translation.huggingface(model = 'mesolitica/translation-t5-small-standard-bahasa-cased',
                                            load_in_8bit=True, device_map='auto')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[13]:
get_model_size_mb(model)
[13]:
109.3828125
[14]:
model.generate(['i like chicken'])
[14]:
['Saya suka ayam']

Load INT4#

Required latest version accelerate and bitsandbytes,

pip3 install accelerate bitsandbytes

Only worked on GPU.

[15]:
model = malaya.translation.huggingface(model = 'mesolitica/translation-t5-small-standard-bahasa-cased',
                                            load_in_4bit=True, device_map='auto')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[16]:
get_model_size_mb(model)
[16]:
94.3828125
[17]:
model.generate(['i like chicken'])
[17]:
['Saya suka ayam']