Different Precision
Contents
Different Precision#
This tutorial is available as an IPython notebook at Malaya/example/different-precision.
Read more at https://huggingface.co/docs/diffusers/optimization/fp16#half-precision-weights
[1]:
%%time
import malaya
import logging
logging.basicConfig(level = logging.INFO)
CPU times: user 2.88 s, sys: 3.46 s, total: 6.34 s
Wall time: 2.21 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[2]:
import torch
[3]:
# https://discuss.pytorch.org/t/finding-model-size/130275
def get_model_size_mb(model):
param_size = 0
for param in model.model.parameters():
param_size += param.nelement() * param.element_size()
buffer_size = 0
for buffer in model.model.buffers():
buffer_size += buffer.nelement() * buffer.element_size()
return (param_size + buffer_size) / 1024**2
Load default precision, FP32#
[5]:
model = malaya.translation.huggingface(model = 'mesolitica/translation-t5-small-standard-bahasa-cased')
Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`, it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[6]:
get_model_size_mb(model)
[6]:
230.765625
[7]:
model.generate(['i like chicken'])
/home/husein/.local/lib/python3.8/site-packages/transformers/generation/utils.py:1260: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
[7]:
['Saya suka ayam']
Load FP16#
Only worked on GPU.
[9]:
model = malaya.translation.huggingface(model = 'mesolitica/translation-t5-small-standard-bahasa-cased',
torch_dtype=torch.float16)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[10]:
get_model_size_mb(model)
[10]:
139.3828125
Load INT8#
Required latest version accelerate
and bitsandbytes
,
pip3 install accelerate bitsandbytes
Only worked on GPU.
[12]:
model = malaya.translation.huggingface(model = 'mesolitica/translation-t5-small-standard-bahasa-cased',
load_in_8bit=True, device_map='auto')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[13]:
get_model_size_mb(model)
[13]:
109.3828125
[14]:
model.generate(['i like chicken'])
[14]:
['Saya suka ayam']
Load INT4#
Required latest version accelerate
and bitsandbytes
,
pip3 install accelerate bitsandbytes
Only worked on GPU.
[15]:
model = malaya.translation.huggingface(model = 'mesolitica/translation-t5-small-standard-bahasa-cased',
load_in_4bit=True, device_map='auto')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[16]:
get_model_size_mb(model)
[16]:
94.3828125
[17]:
model.generate(['i like chicken'])
[17]:
['Saya suka ayam']