Welcome to Malaya’s documentation!
Contents
Welcome to Malaya’s documentation!#
Malaya is a Natural-Language-Toolkit library for bahasa Malaysia, powered by Tensorflow and PyTorch.
Documentation#
Proper documentation is available at https://malaya.readthedocs.io/
Installing from the PyPI#
$ pip install malaya
It will automatically install all dependencies except for Tensorflow and PyTorch. So you can choose your own Tensorflow CPU / GPU version and PyTorch CPU / GPU version.
Only Python >= 3.6.0, Tensorflow >= 1.15.0, and PyTorch >= 1.10 are supported.
If you are a Windows user, make sure read https://malaya.readthedocs.io/en/latest/running-on-windows.html
Development Release#
Install from master branch,
$ pip install git+https://github.com/huseinzol05/malaya.git
We recommend to use virtualenv for development.
Documentation at https://malaya.readthedocs.io/en/latest/
Features#
Alignment, translation word alignment using Eflomal and pretrained Transformer models.
Abstractive text augmentation, augment any text into social media text structure using T5-Bahasa.
Encoder text augmentation, augment any text Wordvector or Transformer-Bahasa word replacement technique.
Rules based text augmentation, augment any text using dictionary of synonym and rules based.
Isi Penting Generator, generate text from list of isi penting using T5-Bahasa.
Prefix Generator, generate text from prefix using GPT2-Bahasa.
Abstractive Keyword, provide abstractive keyword using T5-Bahasa.
Extractive Keyword, provide RAKE, TextRank and Attention Mechanism hybrid with Transformer-Bahasa.
Abstractive Normalizer, normalize any malay texts using T5-Bahasa.
Rules based Normalizer, using local Malaysia NLP researches hybrid with Transformer-Bahasa to normalize any malay texts.
Extractive QA, reading comprehension using T5-Bahasa and Flan-T5.
Doc2Vec Similarity, provide Word2Vec and Encoder interface for text similarity.
Semantic Similarity, provide semantic similarity using T5-Bahasa.
Spelling Correction, using local Malaysia NLP researches hybrid with Transformer-Bahasa to auto-correct any malay words and NeuSpell using T5-Bahasa.
Abstractive Summarization, provide abstractive summarization using T5-Bahasa.
Extractive Summarization, Extractive interface using Transformer-Bahasa and Doc2Vec.
Topic Modeling, provide Transformer-Bahasa, LDA2Vec, LDA, NMF, LSA interface and easy BERTopic integration.
EN-MS Translation, provide English to standard Malay using T5-Bahasa.
MS-EN Translation, provide standard Malay to English using T5-Bahasa.
Zero-shot classification, provide Zero-shot classification interface using Transformer-Bahasa to recognize texts without any labeled training data.
Zero-shot Entity Recognition, provide Zero-shot entity tagging interface using Transformer-Bahasa to extract entities.
Constituency Parsing, breaking a text into sub-phrases using finetuned Transformer-Bahasa.
Coreference Resolution, finding all expressions that refer to the same entity in a text using Dependency Parsing models.
Dependency Parsing, extracting a dependency parse of a sentence using finetuned Transformer-Bahasa and T5-Bahasa.
Emotion Analysis, detect and recognize 6 different emotions of texts using finetuned Transformer-Bahasa.
Entity Recognition, seeks to locate and classify named entities mentioned in text using finetuned Transformer-Bahasa.
Jawi-to-Rumi, convert from Jawi to Rumi using Transformer.
Language Detection, using Fast-text and Sparse Deep learning Model to classify Malay (formal and social media), Indonesia (formal and social media), Rojak language and Manglish.
Language Model, using KenLM, Masked language model using BERT, ALBERT and RoBERTa, and GPT2 to do text scoring.
NSFW Detection, detect NSFW text using rules based and subwords Naive Bayes.
Num2Word, convert from numbers to cardinal or ordinal representation.
Paraphrase, provide Abstractive Paraphrase using T5-Bahasa and Transformer-Bahasa.
Grapheme-to-Phoneme, convert from Grapheme to Phoneme DBP or IPA using LSTM Seq2Seq with attention state-of-art.
Part-of-Speech Recognition, grammatical tagging is the process of marking up a word in a text using finetuned Transformer-Bahasa.
Relevancy Analysis, detect and recognize relevancy of texts using finetuned Transformer-Bahasa.
Rumi-to-Jawi, convert from Rumi to Jawi using Transformer.
Text Segmentation, dividing written text into meaningful words using T5-Bahasa.
Sentiment Analysis, detect and recognize polarity of texts using finetuned Transformer-Bahasa.
Text Similarity, provide interface for lexical similarity deep semantic similarity using finetuned Transformer-Bahasa.
Stemmer, using BPE LSTM Seq2Seq with attention state-of-art to do Bahasa stemming including local language structure.
Subjectivity Analysis, detect and recognize self-opinion polarity of texts using finetuned Transformer-Bahasa.
Kesalahan Tatabahasa, Fix kesalahan tatabahasa using TransformerTag-Bahasa.
Tokenizer, provide word, sentence and syllable tokenizers.
Toxicity Analysis, detect and recognize 27 different toxicity patterns of texts using finetuned Transformer-Bahasa.
Transformer, provide easy interface to load Pretrained Language Malaya models.
True Case, provide true casing utility using T5-Bahasa.
Word2Num, convert from cardinal or ordinal representation to numbers.
Word2Vec, provide pretrained malay wikipedia and malay news Word2Vec, with easy interface and visualization.
Pretrained Models#
Malaya also released Bahasa pretrained models, simply check at Malaya/pretrained-model
ALBERT, a Lite BERT for Self-supervised Learning of Language Representations, https://arxiv.org/abs/1909.11942
ALXLNET, a Lite XLNET, no paper produced.
BERT, Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/abs/1810.04805
BigBird, Transformers for Longer Sequences, https://arxiv.org/abs/2007.14062
ELECTRA, Pre-training Text Encoders as Discriminators Rather Than Generators, https://arxiv.org/abs/2003.10555
GPT2, Language Models are Unsupervised Multitask Learners, https://github.com/openai/gpt-2
LM-Transformer, Exactly like T5, but use Tensor2Tensor instead Mesh Tensorflow with little tweak, no paper produced.
PEGASUS, Pre-training with Extracted Gap-sentences for Abstractive Summarization, https://arxiv.org/abs/1912.08777
T5, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, https://arxiv.org/abs/1910.10683
TinyBERT, Distilling BERT for Natural Language Understanding, https://arxiv.org/abs/1909.10351
Word2Vec, Efficient Estimation of Word Representations in Vector Space, https://arxiv.org/abs/1301.3781
XLNET, Generalized Autoregressive Pretraining for Language Understanding, https://arxiv.org/abs/1906.08237
FNet, FNet: Mixing Tokens with Fourier Transforms, https://arxiv.org/abs/2105.03824
Fastformer, Fastformer: Additive Attention Can Be All You Need, https://arxiv.org/abs/2108.09084
MLM Scoring, Masked Language Model Scoring, https://arxiv.org/abs/1910.14659
References#
If you use our software for research, please cite:
@misc{Malaya, Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow,
author = {Husein, Zolkepli},
title = {Malaya},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya}}
}
Acknowledgement#
Thanks to KeyReply for private V100s cloud and Mesolitica for private RTXs cloud to train Malaya models,
Also, thanks to Tensorflow Research Cloud for free TPUs access.
Contributing#
Thank you for contributing this library, really helps a lot. Feel free to contact me to suggest me anything or want to contribute other kind of forms, we accept everything, not just code!
Contents:#
- Transformer
- Transformer HuggingFace
- Word Vector
- Pretrained word2vec
- List available pretrained word2vec
- Load pretrained word2vec
- Load word vector interface
- Check top-k similar semantics based on a word
- Check batch top-k similar semantics based on a word
- Word2vec calculator
- Visualize scatter-plot
- Visualize tree-plot
- Visualize social-network
- Get embedding from a word
- Spelling Correction using JamSpell
- Spelling Correction using probability
- Spelling Correction using probability LM
- Compare LM on Spelling Correction
- Spelling Correction using Spylls
- Spelling Correction using Symspeller
- Spelling Correction using encoder Transformer
- Spelling Correction using Transformer
- Preprocessing
- Demoji
- Stemmer and Lemmatization
- True Case
- True Case HuggingFace
- Segmentation
- Segmentation HuggingFace
- Num2Word
- Word2Num
- Coreference Resolution
- Abstractive Normalizer HuggingFace
- Rules based Normalizer
- Load normalizer
- Use translator
- Use segmenter
- Use stemmer
- Validate uppercase
- Validate non human word
- Skip spelling correction
- Pass kwargs preprocessing
- Normalize text
- Normalize url
- Normalize email
- Normalize year
- Normalize telephone
- Normalize date
- Normalize time
- Normalize emoji
- Normalize elongated
- Normalize hingga
- Normalize pada hari bulan
- Normalize fraction
- Normalize money
- Normalize units
- Normalize percents
- Normalize IC
- Normalize Numbers
- Normalize x kali
- Normalize Cardinals
- Normalize Ordinals
- Normalize entity
- Prefix Generator
- Isi Penting Generator
- Isi Penting Generator HuggingFace article style
- Isi Penting Generator HuggingFace headline news style
- Isi Penting Generator HuggingFace karangan style
- Isi Penting Generator HuggingFace news style
- Isi Penting Generator HuggingFace product description style
- Paraphrase
- Paraphrase HuggingFace
- Lexicon Generator
- Clustering
- Cluster same word structure based on POS and Entities
- Cluster Part-Of-Speech
- Cluster Entities
- Load example data
- Generate scatter plot for unsupervised clustering
- Generate dendogram plot for unsupervised clustering
- Generate undirected graph for unsupervised clustering
- Generate undirected graph for Entities and topics relationship
- Stacking
- API
- malaya
- malaya.alignment.en_ms
- malaya.alignment.ms_en
- malaya.augmentation.abstractive
- malaya.augmentation.encoder
- malaya.augmentation.rules
- malaya.dictionary
- malaya.generator.isi_penting
- malaya.generator.prefix
- malaya.keyword.abstractive
- malaya.keyword.extractive
- malaya.normalizer.abstractive
- malaya.normalizer.rules
- malaya.qa.extractive
- malaya.similarity.doc2vec
- malaya.similarity.semantic
- malaya.spelling_correction.jamspell
- malaya.spelling_correction.probability
- malaya.spelling_correction.spylls
- malaya.spelling_correction.symspell
- malaya.spelling_correction.transformer
- malaya.summarization.abstractive
- malaya.summarization.extractive
- malaya.topic_model.decomposition
- malaya.topic_model.lda2vec
- malaya.topic_model.transformer
- malaya.translation.en_ms
- malaya.translation.ms_en
- malaya.zero_shot.classification
- malaya.zero_shot.entity
- malaya.cluster
- malaya.constituency
- malaya.coref
- malaya.dependency
- malaya.emotion
- malaya.entity
- malaya.jawi_rumi
- malaya.language_detection
- malaya.language_model
- malaya.lexicon
- malaya.nsfw
- malaya.num2word
- malaya.paraphrase
- malaya.phoneme
- malaya.pos
- malaya.preprocessing
- malaya.relevancy
- malaya.rumi_jawi
- malaya.segmentation
- malaya.sentiment
- malaya.stack
- malaya.stem
- malaya.subjectivity
- malaya.syllable
- malaya.tatabahasa
- malaya.tokenizer
- malaya.toxicity
- malaya.transformer
- malaya.true_case
- malaya.word2num
- malaya.wordvector
- malaya.model.alignment
- malaya.model.bert
- malaya.model.bigbird
- malaya.model.extractive_summarization
- malaya.model.huggingface
- malaya.model.ml
- malaya.model.pegasus
- malaya.model.rules
- malaya.model.t5
- malaya.model.tf
- malaya.model.xlnet
- malaya.torch_model.gpt2_lm
- malaya.torch_model.huggingface
- malaya.torch_model.mask_lm
- malaya.transformers.albert
- malaya.transformers.alxlnet
- malaya.transformers.bert
- malaya.transformers.electra
- malaya.transformers.xlnet
- Donation