How did Malaya gather its corpus?#

Note

This tutorial is available as an IPython notebook here.

We use a translator to translate from a validated English dataset to a Bahasa dataset.

Everyone agrees that Google Translate is the best online translator in the world, but the problem here is that to subscribe to the translation API from Google Cloud is insanely expensive.

Good thing about https://translate.google.com/, is that it is open to the public internet! So we just need to code a headless browser using Selenium with PhantomJS as the backbone, that’s all!

You can check the source code here, translator/

from translate_selenium import Translate, Translate_Concurrent

Translate a sentence#

with open('sample-joy') as fopen:
    dataset = list(filter(None, fopen.read().split('\n')))
len(dataset)
18
translator = Translate(from_lang = 'en', to_lang = 'ms')

You can get list of supported language in here, https://cloud.google.com/translate/docs/languages

%%time
translator.translate(dataset[0])
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.23 s
'seorang lelaki yang saya mengagumi begitu banyak meminta saya untuk pergi bersamanya'

1.23 seconds, it took a very long time to translate a single sentence. What if you have 100k of sentences? It will cost you around 123000 seconds! Insane to wait!

So, we provide multihreading support so that you can concurrently translate multiple sentences.

Translate batch of strings#

translators = Translate_Concurrent(batch_size = 3, from_lang = 'en', to_lang = 'ms')
%%time
translators.translate_batch(dataset[:3])
100%|███████████████████████████████████| 1/1 [00:01<00:00,  1.44s/it]
CPU times: user 8 ms, sys: 12 ms, total: 20 ms
Wall time: 1.44 s
['kawan yang sudah berkahwin rapat hanya mempunyai anak pertamanya',
 'pengenalan rapat menangis untuk saya saya merasa gembira kerana ada yang peduli',
 'seorang lelaki yang saya mengagumi begitu banyak meminta saya untuk pergi bersamanya']

See, we predicted 3 sentences at almost wall time. You can increase the batch_size to any size you want, the only limit is your machine specs now, this method will never trigger Google to block your IP as Malaya had already tested it with more than 300k sentences.

Remember, 1 translator took quite a toll, here I spawned 10 translators, and below is the result of the top command:

PID   USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14628 husein    20   0 3175700 398980  43036 S  33.6  2.4   5:38.05 phantomjs
14652 husein    20   0 3188824 408880  43084 S  29.9  2.5   5:34.62 phantomjs
14489 husein    20   0 3204708 411520  43064 S  28.6  2.5   5:35.29 phantomjs
14466 husein    20   0 3171668 400304  43008 S  24.6  2.5   5:26.74 phantomjs
14443 husein    20   0 3181056 403228  42916 S  21.9  2.5   5:26.24 phantomjs
14512 husein    20   0 3187592 416036  42956 S  20.3  2.6   5:30.03 phantomjs
14558 husein    20   0 3206104 419800  43640 S  19.9  2.6   5:30.76 phantomjs
14535 husein    20   0 3179416 405508  43196 S  18.3  2.5   5:27.54 phantomjs
14420 husein    20   0 3202472 422448  43064 S  17.6  2.6   5:26.78 phantomjs
14581 husein    20   0 3181132 401892  43056 S  16.3  2.5   5:33.48 phantomjs

1 translator costed me around:

PID   USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14628 husein    20   0 3175700 398980  43036 S  33.6  2.4   5:38.05 phantomjs

My machine specifications,

H/W path       Device       Class          Description
======================================================
                            system         G1.Sniper H6 (To be filled by O.E.M.)
/0                          bus            G1.Sniper H6
/0/3d                       processor      Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
/0/42                       memory         16GiB System Memory
/0/42/0                     memory         DIMM [empty]
/0/42/1                     memory         8GiB DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
/0/42/2                     memory         DIMM [empty]
/0/42/3                     memory         8GiB DIMM DDR3 Synchronous 1600 MHz (0.6 ns)
/0/100                      bridge         4th Gen Core Processor DRAM Controller
/0/100/1                    bridge         Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller
/0/100/1/0                  display        GM206 [GeForce GTX 960]
/0/100/1/0.1                multimedia     NVIDIA Corporation

So, beware of the CPU usage for your machine!