We want to make sure that not just the code is open-sourced, but the dataset as well, so everyone can help validate it.

You can visit our repository at Malay-Dataset for the datasets that we used.


  1. Please cite the repository if you use any of our corpus.

  2. We kindly ask that you at least email us first before distributing any of our datasets. Remember that all of these are our hard work and we gave it out for free.

  3. What you only see is just the publicized data, but nobody can see how much we spent to make it public.


Contact us at or if you want to contribute to the bahasa dataset.