Quantization#

This tutorial is available as an IPython notebook at Malaya/example/quantization.

We provided Quantized model for all Malaya models, example, sentiment transformer models,

[1]:
import malaya

malaya.sentiment.available_transformer()
INFO:root:tested on 20% test set.
[1]:
Size (MB) Quantized Size (MB) macro precision macro recall macro f1-score
bert 425.6 111.00 0.99330 0.99330 0.99329
tiny-bert 57.4 15.40 0.98774 0.98774 0.98774
albert 48.6 12.80 0.99227 0.99226 0.99226
tiny-albert 22.4 5.98 0.98554 0.98550 0.98551
xlnet 446.6 118.00 0.99353 0.99353 0.99353
alxlnet 46.8 13.30 0.99188 0.99188 0.99188

Usually quantized model able to compress 4x of original size. This quantized model will convert all possible floating constants to quantized constants, and only stored mean, standard deviation of floating constants and quantized constants.

Again, quantized model is not necessary faster, because tensorflow will cast back to FP32 during feed-forward for certain operations.

Use quantized model#

Simply pass quantized parameter become True, default is False.

[2]:
albert_quantized = malaya.sentiment.transformer(model = 'albert', quantized = True)
albert = malaya.sentiment.transformer(model = 'albert')
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running sentiment/albert-quantized using device /device:CPU:0
INFO:root:running sentiment/albert using device /device:CPU:0
[3]:
string = 'saya masam awak pun masam'
[5]:
%%time

albert.predict([string])
CPU times: user 171 ms, sys: 15.9 ms, total: 187 ms
Wall time: 47.2 ms
[5]:
['negative']
[9]:
%%time

albert_quantized.predict([string])
CPU times: user 181 ms, sys: 41.1 ms, total: 223 ms
Wall time: 53.8 ms
[9]:
['negative']