Welcome to Malaya’s documentation!
Contents
Welcome to Malaya’s documentation!#
Malaya is a Natural-Language-Toolkit library for bahasa Malaysia, powered by PyTorch.
Documentation#
Proper documentation is available at https://malaya.readthedocs.io/
Installing from the PyPI#
$ pip install malaya
It will automatically install all dependencies except for PyTorch. So you can choose your own PyTorch CPU / GPU version.
Only Python >= 3.6.0, and PyTorch >= 1.10 are supported.
If you are a Windows user, make sure read https://malaya.readthedocs.io/en/latest/running-on-windows.html
Development Release#
Install from master branch,
$ pip install git+https://github.com/huseinzol05/malaya.git
We recommend to use virtualenv for development.
Documentation at https://malaya.readthedocs.io/en/latest/
Pretrained Models#
Malaya also released Bahasa pretrained models, simply check at Malaya/pretrained-model
ALBERT, a Lite BERT for Self-supervised Learning of Language Representations, https://arxiv.org/abs/1909.11942
ALXLNET, a Lite XLNET, no paper produced.
BERT, Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/abs/1810.04805
BigBird, Transformers for Longer Sequences, https://arxiv.org/abs/2007.14062
ELECTRA, Pre-training Text Encoders as Discriminators Rather Than Generators, https://arxiv.org/abs/2003.10555
GPT2, Language Models are Unsupervised Multitask Learners, https://github.com/openai/gpt-2
LM-Transformer, Exactly like T5, but use Tensor2Tensor instead Mesh Tensorflow with little tweak, no paper produced.
PEGASUS, Pre-training with Extracted Gap-sentences for Abstractive Summarization, https://arxiv.org/abs/1912.08777
T5, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, https://arxiv.org/abs/1910.10683
TinyBERT, Distilling BERT for Natural Language Understanding, https://arxiv.org/abs/1909.10351
Word2Vec, Efficient Estimation of Word Representations in Vector Space, https://arxiv.org/abs/1301.3781
XLNET, Generalized Autoregressive Pretraining for Language Understanding, https://arxiv.org/abs/1906.08237
FNet, FNet: Mixing Tokens with Fourier Transforms, https://arxiv.org/abs/2105.03824
Fastformer, Fastformer: Additive Attention Can Be All You Need, https://arxiv.org/abs/2108.09084
MLM Scoring, Masked Language Model Scoring, https://arxiv.org/abs/1910.14659
Llama2, Llama 2: Open Foundation and Fine-Tuned Chat Models, https://arxiv.org/abs/2307.09288
Mistral, Mistral 7B, https://arxiv.org/abs/2310.06825
References#
If you use our software for research, please cite:
@misc{Malaya, Natural-Language-Toolkit library for bahasa Malaysia, powered by PyTorch,
author = {Husein, Zolkepli},
title = {Malaya},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/mesolitica/malaya}}
}
Acknowledgement#
Thanks to,
KeyReply for private V100s cloud.
Mesolitica for private RTXs cloud.
Nvidia for Azure credit.
Tensorflow Research Cloud for free TPUs access.
Contributing#
Thank you for contributing this library, really helps a lot. Feel free to contact me to suggest me anything or want to contribute other kind of forms, we accept everything, not just code!
Contents:#
- Speech Toolkit
- Installation
- Dataset
- Running on Windows
- Contributing
- API
- malaya
- malaya.augmentation.abstractive
- malaya.augmentation.rules
- malaya.dictionary
- malaya.generator.isi_penting
- malaya.keyword.abstractive
- malaya.keyword.extractive
- malaya.normalizer.rules
- malaya.qa.extractive
- malaya.similarity.doc2vec
- malaya.similarity.semantic
- malaya.spelling_correction.jamspell
- malaya.spelling_correction.probability
- malaya.spelling_correction.spylls
- malaya.spelling_correction.symspell
- malaya.summarization.abstractive
- malaya.summarization.extractive
- malaya.topic_model.decomposition
- malaya.topic_model.transformer
- malaya.zero_shot.classification
- malaya.cluster
- malaya.constituency
- malaya.dependency
- malaya.embedding
- malaya.emotion
- malaya.entity
- malaya.jawi
- malaya.knowledge_graph
- malaya.language_detection
- malaya.language_model
- malaya.llm
- malaya.nsfw
- malaya.num2word
- malaya.paraphrase
- malaya.pos
- malaya.preprocessing
- malaya.segmentation
- malaya.sentiment
- malaya.stack
- malaya.stem
- malaya.syllable
- malaya.tatabahasa
- malaya.tokenizer
- malaya.transformer
- malaya.translation
- malaya.true_case
- malaya.word2num
- malaya.wordvector
- malaya.model.extractive_summarization
- malaya.model.ml
- malaya.model.rules
- malaya.torch_model.gpt2_lm
- malaya.torch_model.huggingface
- malaya.torch_model.llm
- malaya.torch_model.mask_lm
- Preprocessing
- Demoji
- Stemmer and Lemmatization
- True Case
- Segmentation
- Num2Word
- Word2Num
- Rules based Normalizer
- Load normalizer
- Use translator
- Use segmenter
- Use stemmer
- Validate uppercase
- Validate non human word
- Skip spelling correction
- Pass kwargs preprocessing
- Normalize text
- Normalize url
- Normalize email
- Normalize year
- Normalize telephone
- Normalize date
- Normalize time
- Normalize emoji
- Normalize elongated
- Normalize hingga
- Normalize pada hari bulan
- Normalize fraction
- Normalize money
- Normalize units
- Normalize percents
- Normalize IC
- Normalize Numbers
- Normalize x kali
- Normalize Cardinals
- Normalize Ordinals
- Normalize entity