How Malaya gathered corpus?

Note

This tutorial is available as an IPython notebook here.

We use a translator to translate from a validated English dataset to a Bahasa dataset.

Everyone agree that Google Translate is the best online translator in this world, but the problem here, to subscribe the API from Google Cloud is really insane expensive.

Good thing about https://translate.google.com/, it open for public internet! So we just code a headless browser using Selenium with PhantomJS as the backbone, that’s all!

You can check the source code here, translator/

from translate_selenium import Translate, Translate_Concurrent

Translate a sentence

with open('sample-joy') as fopen:
    dataset = list(filter(None, fopen.read().split('\n')))
len(dataset)
18
translator = Translate(from_lang = 'en', to_lang = 'ms')

You can get list of supported language in here, https://cloud.google.com/translate/docs/languages

%%time
translator.translate(dataset[0])
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.23 s
'seorang lelaki yang saya mengagumi begitu banyak meminta saya untuk pergi bersamanya'

1.23 seconds, it took a very long time to translate a single sentence. What if you have 100k of sentences? It will cost you around 123000 seconds! insane to wait!

So, we provide multihreading translator, concurrently translate multi sentences.

Translate batch of strings

translators = Translate_Concurrent(batch_size = 3, from_lang = 'en', to_lang = 'ms')
%%time
translators.translate_batch(dataset[:3])
100%|███████████████████████████████████| 1/1 [00:01<00:00,  1.44s/it]
CPU times: user 8 ms, sys: 12 ms, total: 20 ms
Wall time: 1.44 s
['kawan yang sudah berkahwin rapat hanya mempunyai anak pertamanya',
 'pengenalan rapat menangis untuk saya saya merasa gembira kerana ada yang peduli',
 'seorang lelaki yang saya mengagumi begitu banyak meminta saya untuk pergi bersamanya']

See, we predicted 3 sentences at almost wall time. You can increase the batch_size to any size you want, limit is your spec now, this method will never make Google blocked your IP. Malaya already tested it more than 300k of sentences.

Remember, 1 translator took a quite toll, here I spawned 10 translators, look from my top,

PID   USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14628 husein    20   0 3175700 398980  43036 S  33.6  2.4   5:38.05 phantomjs
14652 husein    20   0 3188824 408880  43084 S  29.9  2.5   5:34.62 phantomjs
14489 husein    20   0 3204708 411520  43064 S  28.6  2.5   5:35.29 phantomjs
14466 husein    20   0 3171668 400304  43008 S  24.6  2.5   5:26.74 phantomjs
14443 husein    20   0 3181056 403228  42916 S  21.9  2.5   5:26.24 phantomjs
14512 husein    20   0 3187592 416036  42956 S  20.3  2.6   5:30.03 phantomjs
14558 husein    20   0 3206104 419800  43640 S  19.9  2.6   5:30.76 phantomjs
14535 husein    20   0 3179416 405508  43196 S  18.3  2.5   5:27.54 phantomjs
14420 husein    20   0 3202472 422448  43064 S  17.6  2.6   5:26.78 phantomjs
14581 husein    20   0 3181132 401892  43056 S  16.3  2.5   5:33.48 phantomjs

1 translator cost me around,

PID   USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14628 husein    20   0 3175700 398980  43036 S  33.6  2.4   5:38.05 phantomjs

My machine specifications,

H/W path       Device       Class          Description
======================================================
                            system         G1.Sniper H6 (To be filled by O.E.M.)
/0                          bus            G1.Sniper H6
/0/3d                       processor      Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
/0/42                       memory         16GiB System Memory
/0/42/0                     memory         DIMM [empty]
/0/42/1                     memory         8GiB DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
/0/42/2                     memory         DIMM [empty]
/0/42/3                     memory         8GiB DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
/0/100                      bridge         4th Gen Core Processor DRAM Controller
/0/100/1                    bridge         Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller
/0/100/1/0                  display        GM206 [GeForce GTX 960]
/0/100/1/0.1                multimedia     NVIDIA Corporation

So, beware of your machine!