Quantization
Contents
Quantization#
This tutorial is available as an IPython notebook at Malaya/example/quantization.
We provided Quantized model for all Malaya models, example, sentiment transformer models,
[1]:
import malaya
malaya.sentiment.available_transformer()
INFO:root:tested on 20% test set.
[1]:
Size (MB) | Quantized Size (MB) | macro precision | macro recall | macro f1-score | |
---|---|---|---|---|---|
bert | 425.6 | 111.00 | 0.99330 | 0.99330 | 0.99329 |
tiny-bert | 57.4 | 15.40 | 0.98774 | 0.98774 | 0.98774 |
albert | 48.6 | 12.80 | 0.99227 | 0.99226 | 0.99226 |
tiny-albert | 22.4 | 5.98 | 0.98554 | 0.98550 | 0.98551 |
xlnet | 446.6 | 118.00 | 0.99353 | 0.99353 | 0.99353 |
alxlnet | 46.8 | 13.30 | 0.99188 | 0.99188 | 0.99188 |
Usually quantized model able to compress 4x of original size. This quantized model will convert all possible floating constants to quantized constants, and only stored mean, standard deviation of floating constants and quantized constants.
Again, quantized model is not necessary faster, because tensorflow will cast back to FP32 during feed-forward for certain operations.
Use quantized model#
Simply pass quantized
parameter become True
, default is False
.
[2]:
albert_quantized = malaya.sentiment.transformer(model = 'albert', quantized = True)
albert = malaya.sentiment.transformer(model = 'albert')
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running sentiment/albert-quantized using device /device:CPU:0
INFO:root:running sentiment/albert using device /device:CPU:0
[3]:
string = 'saya masam awak pun masam'
[5]:
%%time
albert.predict([string])
CPU times: user 171 ms, sys: 15.9 ms, total: 187 ms
Wall time: 47.2 ms
[5]:
['negative']
[9]:
%%time
albert_quantized.predict([string])
CPU times: user 181 ms, sys: 41.1 ms, total: 223 ms
Wall time: 53.8 ms
[9]:
['negative']