Deployment
Contents
Deployment#
This tutorial is available as an IPython notebook at Malaya/example/deployment.
[1]:
import malaya
Disable file validation#
If you deployed some of Malaya models on persist, short-life (auto-restart to reduce memory consumption) and async / multiprocess workers, you might get errors related to file checking. You can skip this error as long you able to persist malaya models.
download model#
So, first you need to download the model into your local machine / environment, run this on different script,
[5]:
model = malaya.zero_shot.classification.transformer(model = 'tiny-albert')
INFO:tensorflow:loading sentence piece model
load model#
Load model without need to check model, run this on top of fastapi / flask / gunicorn.
[6]:
model = malaya.zero_shot.classification.transformer(model = 'tiny-albert', validate = False)
INFO:tensorflow:loading sentence piece model
This loaded model able to share among multi-workers / multi-threads.
disable type checking#
Make sure you already install latest version herpetologist,
pip install herpetologist -U
If you check Malaya source code, you can see we check parameters on function / method definition, https://github.com/huseinzol05/Malaya/blob/master/malaya/model/bert.py#L232
We use herpetologist to check passed variables, https://github.com/huseinzol05/herpetologist
@check_type
def predict(self, strings: List[str], add_neutral: bool = True):
"""
classify a string.
Parameters
----------
strings: List[str]
add_neutral: bool, optional (default=True)
if True, it will add neutral probability.
Returns
-------
result: List[str]
"""
@check_type
will check strings
is a List[str]
or not, if not, it will throw an error.
But this @check_type
will become expensive if you have massive list of strings. So you can disable to this type checking by simply set bash environment.
Some of our environments we want to enable it, some of it also we want to disable, and we do not want herpetologist to keep check the variables. So to disable it, simply set bash environment,
export ENABLE_HERPETOLOGIST=false
Or, using python,
import os
os.environ['ENABLE_HERPETOLOGIST'] = 'false'
You can see impact of time execution in this example.
Use smaller model#
Stacking multiple smaller models much faster than a single big model. But this cannot ensure the accuracy will be same as the big model.
docker example#
You can check some docker examples and benchmarks at here, https://github.com/huseinzol05/Malaya/tree/master/misc/deployment.
The purpose of these benchmarks, how fast and how much requests for a model able to serve on perfect minibatch realtime, let say live streaming data from social media to detect sentiment, whether a text is a negative or a positive. Tested on ALBERT-BASE sentiment model.
These are my machine specifications,
Intel(R) Core(TM) i7-8557U CPU @ 1.70GHz
16 GB 2133 MHz LPDDR3
And I use same wrk command,
wrk -t15 -c600 -d1m --timeout=15s http://localhost:8080/?string=husein%20sangat%20comel%20dan%20handsome%20tambahan%20lagi%20ketiak%20wangi
Some constraints,
ALBERT BASE is around 43MB.
Limit memory is 2GB, set by Docker itself.
batch size of 50 strings, duplicate 50 times of
husein sangat comel dan handsome tambahan lagi ketiak wangi
, can check every deployment in app.py or main.py.No limit on CPU usage.
no caching.
fast-api#
workers automatically calculated by fast-api, https://github.com/huseinzol05/Malaya/tree/master/misc/deployment/fast-api
Running 1m test @ http://localhost:8080/?string=husein%20sangat%20comel%20dan%20handsome%20tambahan%20lagi%20ketiak%20wangi
15 threads and 600 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 0.00us 0.00us 0.00us nan%
Req/Sec 0.24 1.16 9.00 95.52%
68 requests in 1.00m, 8.96KB read
Socket errors: connect 364, read 293, write 0, timeout 68
Requests/sec: 1.13
Transfer/sec: 152.75B
Gunicorn Flask#
5 sync workers, https://github.com/huseinzol05/Malaya/tree/master/misc/deployment/gunicorn-flask
Running 1m test @ http://localhost:8080/?string=husein%20sangat%20comel%20dan%20handsome%20tambahan%20lagi%20ketiak%20wangi
15 threads and 600 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.98s 3.25s 12.71s 41.67%
Req/Sec 0.49 1.51 9.00 90.91%
59 requests in 1.00m, 9.10KB read
Socket errors: connect 364, read 39, write 0, timeout 47
Requests/sec: 0.98
Transfer/sec: 155.12B
UWSGI Flask + Auto scaling#
Min 2 worker, Max 10 workers, spare2 algorithm, https://github.com/huseinzol05/Malaya/tree/master/misc/deployment/uwsgi-flask-cheaper
Running 1m test @ http://localhost:8080/?string=husein%20sangat%20comel%20dan%20handsome%20tambahan%20lagi%20ketiak%20wangi
15 threads and 600 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 8.80s 4.16s 14.73s 62.50%
Req/Sec 0.75 2.60 9.00 91.67%
12 requests in 1.00m, 0.90KB read
Socket errors: connect 364, read 105, write 0, timeout 4
Requests/sec: 0.20
Transfer/sec: 15.37B
UWSGI Flask#
4 Workers, https://github.com/huseinzol05/Malaya/tree/master/misc/deployment/uwsgi-flask-fork
Running 1m test @ http://localhost:8080/?string=husein%20sangat%20comel%20dan%20handsome%20tambahan%20lagi%20ketiak%20wangi
15 threads and 600 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 8.79s 4.13s 14.87s 53.33%
Req/Sec 1.06 3.16 20.00 92.59%
56 requests in 1.00m, 4.21KB read
Socket errors: connect 364, read 345, write 0, timeout 41
Requests/sec: 0.93
Transfer/sec: 71.74B
Learn different deployment techniques#
Eg, Change concurrent requests into mini-batch realtime processing to speed up text classification, repository
This can reduce time taken up to 95%!
[ ]: