We want to make sure not just the code we open-sourced, but also goes to dataset, so everyone can validate.

You can check in Malay-Dataset for our open dataset.


  1. Please citate the repository if use these corpus.

  2. Please at least email us first before distributing these data. Remember all these hard workings we want to give it for free.

  3. What do you see just the data, but nobody can see how much we spent our cost to make it public.


Contact us at or if want to contribute to bahasa dataset.