Welcome to Malaya’s documentation!
Contents
Welcome to Malaya’s documentation!#
Malaya is a Natural-Language-Toolkit library for bahasa Malaysia, powered by PyTorch.
Documentation#
Proper documentation is available at https://malaya.readthedocs.io/
Installing from the PyPI#
$ pip install malaya
It will automatically install all dependencies except for PyTorch. So you can choose your own PyTorch CPU / GPU version.
Only Python >= 3.6.0, and PyTorch >= 1.10 are supported.
If you are a Windows user, make sure read https://malaya.readthedocs.io/en/latest/running-on-windows.html
Development Release#
Install from master branch,
$ pip install git+https://github.com/huseinzol05/malaya.git
We recommend to use virtualenv for development.
Documentation at https://malaya.readthedocs.io/en/latest/
Pretrained Models#
Malaya also released Bahasa pretrained models, simply check at Malaya/pretrained-model
ALBERT, a Lite BERT for Self-supervised Learning of Language Representations, https://arxiv.org/abs/1909.11942
ALXLNET, a Lite XLNET, no paper produced.
BERT, Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/abs/1810.04805
BigBird, Transformers for Longer Sequences, https://arxiv.org/abs/2007.14062
ELECTRA, Pre-training Text Encoders as Discriminators Rather Than Generators, https://arxiv.org/abs/2003.10555
GPT2, Language Models are Unsupervised Multitask Learners, https://github.com/openai/gpt-2
LM-Transformer, Exactly like T5, but use Tensor2Tensor instead Mesh Tensorflow with little tweak, no paper produced.
PEGASUS, Pre-training with Extracted Gap-sentences for Abstractive Summarization, https://arxiv.org/abs/1912.08777
T5, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, https://arxiv.org/abs/1910.10683
TinyBERT, Distilling BERT for Natural Language Understanding, https://arxiv.org/abs/1909.10351
Word2Vec, Efficient Estimation of Word Representations in Vector Space, https://arxiv.org/abs/1301.3781
XLNET, Generalized Autoregressive Pretraining for Language Understanding, https://arxiv.org/abs/1906.08237
FNet, FNet: Mixing Tokens with Fourier Transforms, https://arxiv.org/abs/2105.03824
Fastformer, Fastformer: Additive Attention Can Be All You Need, https://arxiv.org/abs/2108.09084
MLM Scoring, Masked Language Model Scoring, https://arxiv.org/abs/1910.14659
Llama2, Llama 2: Open Foundation and Fine-Tuned Chat Models, https://arxiv.org/abs/2307.09288
Mistral, Mistral 7B, https://arxiv.org/abs/2310.06825
References#
If you use our software for research, please cite:
@misc{Malaya, Natural-Language-Toolkit library for bahasa Malaysia, powered by PyTorch,
author = {Husein, Zolkepli},
title = {Malaya},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/mesolitica/malaya}}
}
Acknowledgement#
Thanks to,
KeyReply for private V100s cloud.
Mesolitica for private RTXs cloud.
Nvidia for Azure credit.
Tensorflow Research Cloud for free TPUs access.
Contributing#
Thank you for contributing this library, really helps a lot. Feel free to contact me to suggest me anything or want to contribute other kind of forms, we accept everything, not just code!
Contents:#
Getting Started
- Speech Toolkit
- Installation
- Dataset
- Running on Windows
- Contributing
- API
- malaya
- malaya.augmentation.abstractive
- malaya.augmentation.rules
- malaya.dictionary
- malaya.generator.isi_penting
- malaya.keyword.abstractive
- malaya.keyword.extractive
- malaya.normalizer.rules
- malaya.qa.extractive
- malaya.similarity.doc2vec
- malaya.similarity.semantic
- malaya.spelling_correction.jamspell
- malaya.spelling_correction.probability
- malaya.spelling_correction.spylls
- malaya.spelling_correction.symspell
- malaya.summarization.abstractive
- malaya.summarization.extractive
- malaya.topic_model.decomposition
- malaya.topic_model.transformer
- malaya.zero_shot.classification
- malaya.cluster
- malaya.constituency
- malaya.dependency
- malaya.embedding
- malaya.emotion
- malaya.entity
- malaya.jawi
- malaya.knowledge_graph
- malaya.language_detection
- malaya.language_model
- malaya.llm
- malaya.nsfw
- malaya.num2word
- malaya.paraphrase
- malaya.pos
- malaya.preprocessing
- malaya.segmentation
- malaya.sentiment
- malaya.stack
- malaya.stem
- malaya.syllable
- malaya.tatabahasa
- malaya.tokenizer
- malaya.transformer
- malaya.translation
- malaya.true_case
- malaya.word2num
- malaya.wordvector
- malaya.model.extractive_summarization
- malaya.model.ml
- malaya.model.rules
- malaya.torch_model.gpt2_lm
- malaya.torch_model.huggingface
- malaya.torch_model.llm
- malaya.torch_model.mask_lm
GPU Environment
Pre-trained model
Augmentation Module
Dictionary Module
Tokenization Module
Language Model Module
Spelling Correction Module
Normalization Module
- Preprocessing
- Demoji
- Stemmer and Lemmatization
- True Case
- Segmentation
- Num2Word
- Word2Num
- Rules based Normalizer
- Load normalizer
- Use translator
- Use segmenter
- Use stemmer
- Validate uppercase
- Validate non human word
- Skip spelling correction
- Pass kwargs preprocessing
- Normalize text
- Normalize url
- Normalize email
- Normalize year
- Normalize telephone
- Normalize date
- Normalize time
- Normalize emoji
- Normalize elongated
- Normalize hingga
- Normalize pada hari bulan
- Normalize fraction
- Normalize money
- Normalize units
- Normalize percents
- Normalize IC
- Normalize Numbers
- Normalize x kali
- Normalize Cardinals
- Normalize Ordinals
- Normalize entity
Jawi Module
Kesalahan Tatabahasa Module
Generative Module
Classification Module
Similarity Module
Parsing Module
Summarization Module
Translation Module
Question Answer Module
Zeroshot Module
Topic Modeling Module
Keyword Module
Knowledge Graph