API#

malaya#

malaya.alignment.en_ms#

malaya.alignment.en_ms.available_huggingface()[source]#

List available HuggingFace models.

malaya.alignment.en_ms.eflomal(preprocessing_func=None, **kwargs)[source]#

load eflomal word alignment for EN-MS. Model size around ~300MB. :type preprocessing_func: Optional[Callable] :param preprocessing_func: preprocessing function to call during loading prior file.

Using malaya.text.function.replace_punct able to reduce ~30% of memory usage.

Returns

result

Return type

malaya.model.alignment.Eflomal

malaya.alignment.ms_en#

malaya.alignment.ms_en.eflomal(preprocessing_func=None, **kwargs)[source]#

load eflomal word alignment for MS-EN. Model size around ~300MB.

Parameters

preprocessing_func (Callable, optional (default=None)) – preprocessing function to call during loading prior file. Using malaya.text.function.replace_punct able to reduce ~30% of memory usage.

Returns

result

Return type

malaya.model.alignment.Eflomal

malaya.alignment.ms_en.huggingface(model='mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms', force_check=True, **kwargs)[source]#

Load huggingface BERT model word alignment for MS-EN, Required Tensorflow >= 2.0.

Parameters
  • model (str, optional (default='mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms')) – Check available models at malaya.alignment.ms_en.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.model.alignment.HuggingFace

malaya.augmentation.abstractive#

malaya.augmentation.abstractive.available_huggingface()[source]#

List available huggingface models.

malaya.augmentation.abstractive.huggingface(model='mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v4', lang='ms', force_check=True, **kwargs)[source]#

Load HuggingFace model to abstractive text augmentation.

Parameters
  • model (str, optional (default='mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v4')) – Check available models at malaya.augmentation.abstractive.available_huggingface().

  • lang (str, optional (default='ms')) – Input language, only accept ms or en.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Generator

malaya.augmentation.encoder#

malaya.augmentation.encoder.wordvector(string, wordvector, threshold=0.5, top_n=5, soft=False)[source]#

augmenting a string using wordvector.

Parameters
  • string (str) – this string input assumed been properly tokenized and cleaned.

  • wordvector (object) – wordvector interface object.

  • threshold (float, optional (default=0.5)) – random selection for a word.

  • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.

  • top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.

Returns

result

Return type

List[str]

malaya.augmentation.encoder.transformer(string, model, threshold=0.5, top_p=0.9, top_k=100, temperature=1.0, top_n=5)[source]#

augmenting a string using transformer + nucleus sampling / top-k sampling.

Parameters
  • string (str) – this string input assumed been properly tokenized and cleaned.

  • model (object) – transformer interface object. Right now only supported BERT, ALBERT and ELECTRA.

  • threshold (float, optional (default=0.5)) – random selection for a word.

  • top_p (float, optional (default=0.8)) – cumulative sum of probabilities to sample a word. If top_n bigger than 0, the model will use nucleus sampling, else top-k sampling.

  • top_k (int, optional (default=100)) – k for top-k sampling.

  • temperature (float, optional (default=0.8)) – logits * temperature.

  • top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.

Returns

result

Return type

List[str]

malaya.augmentation.rules#

malaya.augmentation.rules.synonym(string, threshold=0.5, top_n=5, **kwargs)[source]#

augmenting a string using synonym, https://github.com/huseinzol05/Malaya-Dataset#90k-synonym

Parameters
  • string (str) – this string input assumed been properly tokenized and cleaned.

  • threshold (float, optional (default=0.5)) – random selection for a word.

  • top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.

Returns

result

Return type

List[str]

malaya.augmentation.rules.replace_similar_consonants(word, threshold=0.5, replace_consonants={'b': ['n'], 'd': ['s', 'f'], 'f': ['p'], 'g': ['f', 'h'], 'j': ['k'], 'k': ['l'], 'n': ['m'], 'r': ['t', 'q']})[source]#

Naively replace consonants with another consonants to simulate typo or slang if after consonants is a vowel.

Parameters
  • word (str) –

  • threshold (float, optional (default=0.5)) –

Returns

result

Return type

List[str]

malaya.augmentation.rules.replace_similar_vowels(word, threshold=0.5, replace_vowels={'a': ['o'], 'i': ['o'], 'o': ['u'], 'u': ['o']})[source]#

Naively replace vowels with another vowels to simulate typo or slang if after vowels is a consonant.

Parameters
  • word (str) –

  • threshold (float, optional (default=0.5)) –

Returns

result

Return type

str

malaya.augmentation.rules.socialmedia_form(word)[source]#

augmenting a word into socialmedia form.

Parameters

word (str) –

Returns

result

Return type

List[str]

malaya.augmentation.rules.vowel_alternate(word, threshold=0.5)[source]#

augmenting a word into vowel alternate.

vowel_alternate(‘singapore’) -> sngpore

vowel_alternate(‘kampung’) -> kmpng

vowel_alternate(‘ayam’) -> aym

Parameters
  • word (str) –

  • threshold (float, optional (default=0.5)) –

Returns

result

Return type

str

malaya.augmentation.rules.kelantanese_form(word)[source]#

augmenting a word into kelantanese form. ayam -> ayom otak -> otok kakak -> kakok

barang -> bare kembang -> kembe nyarang -> nyare

Parameters

word (str) –

Returns

result

Return type

List[str]

malaya.dictionary#

malaya.dictionary.keyword_wiktionary(word, acceptable_lang=['brunei malay', 'malay'])[source]#

crawl https://en.wiktionary.org/wiki/ to check a word is a malay word.

Parameters
  • word (str) –

  • acceptable_lang (List[str], optional (default=['brunei malay', 'malay'])) – acceptable languages in wiktionary section.

Returns

result

Return type

Dict

malaya.dictionary.keyword_dbp(word, parse=False)[source]#

crawl https://prpm.dbp.gov.my/cari1?keyword= to check a word is a malay word.

Parameters
  • word (str) –

  • parse (bool, optional (default=False)) – if True, will parse using BeautifulSoup.

Returns

result

Return type

Dict

malaya.dictionary.corpus_dbp(word)[source]#

crawl http://sbmb.dbp.gov.my/korpusdbp/Search2.aspx to search corpus based on a word.

Parameters

word (str) –

Returns

result

Return type

pandas.core.frame.DataFrame

malaya.dictionary.is_english(word)[source]#

Check a word is an english word.

Parameters

word (str) –

Returns

result

Return type

bool

malaya.dictionary.is_malay(word, stemmer=None)[source]#

Check a word is a malay word.

Parameters
  • word (str) –

  • stemmer (Callable, optional (default=None)) – a Callable object, must have stem_word method.

Returns

result

Return type

bool

malaya.dictionary.convert_pinyin(string)[source]#

Convert mandarin characters to pinyin form. Original vocab from https://github.com/lxyu/pinyin 你好 -> ni hao

Parameters

string (str) –

Returns

result

Return type

str

malaya.generator.isi_penting#

malaya.generator.isi_penting.available_transformer()[source]#

List available transformer models.

malaya.generator.isi_penting.available_huggingface()[source]#

List available huggingface models.

malaya.generator.isi_penting.transformer(model='t5', quantized=False, **kwargs)[source]#

Load Transformer model to generate a string given a isu penting.

Parameters
  • model (str, optional (default='base')) – Check available models at malaya.generator.isi_penting.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.t5.Generator class

malaya.generator.isi_penting.huggingface(model='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to generate text based on isi penting.

Parameters
  • model (str, optional (default='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased')) – Check available models at malaya.generator.isi_penting.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.IsiPentingGenerator

malaya.generator.prefix#

malaya.generator.prefix.babble_tf(string, model, generate_length=30, leed_out_len=1, temperature=1.0, top_k=100, burnin=15, batch_size=5)[source]#

Use pretrained malaya transformer models to generate a string given a prefix string. https://github.com/nyu-dl/bert-gen, https://arxiv.org/abs/1902.04094

Parameters
  • string (str) –

  • model (object) – transformer interface object. Right now only supported BERT, ALBERT and ELECTRA.

  • generate_length (int, optional (default=256)) – length of sentence to generate.

  • leed_out_len (int, optional (default=1)) – length of extra masks for each iteration.

  • temperature (float, optional (default=1.0)) – logits * temperature.

  • top_k (int, optional (default=100)) – k for top-k sampling.

  • burnin (int, optional (default=15)) – for the first burnin steps, sample from the entire next word distribution, instead of top_k.

  • batch_size (int, optional (default=5)) – generate sentences size of batch_size.

Returns

result

Return type

List[str]

malaya.generator.prefix.available_transformer()[source]#

List available gpt2 generator models.

malaya.generator.prefix.available_huggingface()[source]#

List available gpt2 generator models.

malaya.generator.prefix.transformer(model='345M', quantized=False, **kwargs)[source]#

Load GPT2 model to generate a string given a prefix string.

Parameters
  • model (str, optional (default='345M')) – Check available models at malaya.generator.prefix.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.GPT2 class

malaya.generator.prefix.huggingface(model='mesolitica/gpt2-117m-bahasa-cased-v2', force_check=True, **kwargs)[source]#

Load Prefix language model.

Parameters
  • model (str, optional (default='mesolitica/gpt2-117m-bahasa-cased-v2')) – Check available models at malaya.generator.prefix.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Prefix class

malaya.keyword.abstractive#

malaya.keyword.abstractive.available_huggingface()[source]#

List available huggingface models.

malaya.keyword.abstractive.huggingface(model='mesolitica/finetune-keyword-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to abstractive keyword.

Parameters
  • model (str, optional (default='mesolitica/finetune-keyword-t5-small-standard-bahasa-cased')) – Check available models at malaya.keyword.abstractive.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Keyword

malaya.keyword.extractive#

malaya.keyword.extractive.rake(string, vocab=None, model=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]#

Extract keywords using Rake algorithm.

Parameters
  • string (str) –

  • vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.

  • model (Object, optional (default=None)) – model must has attention method.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] For automatic Ngram generator.

Returns

result

Return type

Tuple[float, str]

malaya.keyword.extractive.textrank(string, vocab=None, model=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]#

Extract keywords using Textrank algorithm.

Parameters
  • string (str) –

  • vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.

  • model (Object, optional (default='None')) – model must has fit_transform or vectorize method.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

Tuple[float, str]

malaya.keyword.extractive.attention(string, model, vocab=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]#

Extract keywords using Attention mechanism.

Parameters
  • string (str) –

  • model (Object) – model must has attention method.

  • vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

Tuple[float, str]

malaya.keyword.extractive.similarity(string, model, vocab=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]#

Extract keywords using Sentence embedding VS keyword embedding similarity.

Parameters
  • string (str) –

  • model (Object) – Transformer model or any model has vectorize method.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

Tuple[float, str]

malaya.keyword.extractive.available_transformer()[source]#

List available transformer keyword similarity model.

malaya.keyword.extractive.transformer(model='bert', quantized=False, **kwargs)[source]#

Load Transformer keyword similarity model.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.keyword.extractive.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.KeyphraseBERT.

  • if xlnet in model, will return malaya.model.xlnet.KeyphraseXLNET.

Return type

model

malaya.normalizer.abstractive#

malaya.normalizer.abstractive.available_huggingface()[source]#

List available huggingface models.

malaya.normalizer.abstractive.huggingface(model='mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v4', force_check=True, use_rules_normalizer=True, kenlm_model='bahasa-wiki-news', stem_model='noisy', segmenter=None, text_scorer=None, replace_augmentation=True, minlen_speller=2, maxlen_speller=12, **kwargs)[source]#

Load HuggingFace model to abstractive text normalizer. text -> rules based text normalizer -> abstractive. To skip rules based text normalizer, set use_rules_normalizer=False.

Parameters
  • model (str, optional (default='mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v4')) – Check available models at malaya.normalizer.abstractive.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

  • use_rules_normalizer (bool, optional(default=True)) –

  • kenlm_model (str, optional (default='bahasa-wiki-news')) – the model trained on malaya.language_model.kenlm(model = ‘bahasa-wiki-news’), but you can use any kenlm model from malaya.language_model.available_kenlm. Also you can pass as None to skip spelling correction but still apply rules normalizer. This parameter will be ignored if use_rules_normalizer=False.

  • stem_model (str, optional (default='noisy')) – the model trained on malaya.stem.deep_model(model = ‘noisy’), but you can use any stemmer model from `malaya.stem.available_model. Also you can pass as None to skip stemming but still apply rules normalizer. This parameter will be ignored if use_rules_normalizer=False.

  • segmenter (Callable, optional (default=None)) – segmenter function to segment text, read more at https://malaya.readthedocs.io/en/stable/load-normalizer.html#Use-segmenter during training session, we use malaya.segmentation.huggingface(). It is save to set as None. This parameter will be ignored if use_rules_normalizer=False.

  • text_scorer (Callable, optional (default=None)) – text scorer to validate upper case. during training session, we use malaya.language_model.kenlm(model = ‘bahasa-wiki-news’). This parameter will be ignored if use_rules_normalizer=False.

  • replace_augmentation (bool, optional (default=True)) – Use replace norvig augmentation method. Enabling this might generate bigger candidates, hence slower. This parameter will be ignored if use_rules_normalizer=False.

  • minlen_speller (int, optional (default=2)) – minimum length of word to check spelling correction. This parameter will be ignored if use_rules_normalizer=False.

  • maxlen_speller (int, optional (default=12)) – max length of word to check spelling correction. This parameter will be ignored if use_rules_normalizer=False.

Returns

result

Return type

malaya.torch_model.huggingface.Normalizer

malaya.normalizer.rules#

malaya.normalizer.rules.load(speller=None, stemmer=None, **kwargs)[source]#

Load a Normalizer using any spelling correction model.

Parameters
  • speller (Callable, optional (default=None)) – function to correct spelling, must have correct or normalize_elongated method.

  • stemmer (Callable, optional (default=None)) – function to stem, must have stem_word method. If provide stemmer, will accurately to stem kata imbuhan akhir.

Returns

result

Return type

malaya.normalizer.rules.Normalizer class

class malaya.normalizer.rules.Normalizer[source]#
normalize(string, normalize_text=True, normalize_url=False, normalize_email=False, normalize_year=True, normalize_telephone=True, normalize_date=True, normalize_time=True, normalize_emoji=True, normalize_elongated=True, normalize_hingga=True, normalize_pada_hari_bulan=True, normalize_fraction=True, normalize_money=True, normalize_units=True, normalize_percent=True, normalize_ic=True, normalize_number=True, normalize_x_kali=True, normalize_cardinal=True, normalize_ordinal=True, normalize_entity=True, expand_contractions=True, check_english_func=<function is_english>, check_malay_func=<function is_malay>, translator=None, language_detection_word=None, acceptable_language_detection=['EN', 'CAPITAL', 'NOT_LANG'], segmenter=None, text_scorer=None, text_scorer_window=2, not_a_word_threshold=0.0001, dateparser_settings={'TIMEZONE': 'GMT+8'}, **kwargs)[source]#

Normalize a string.

Parameters
  • string (str) –

  • normalize_text (bool, optional (default=True)) – if True, will try to replace shortforms with internal corpus.

  • normalize_url (bool, optional (default=False)) – if True, replace :// with empty and . with dot. https://huseinhouse.com -> https huseinhouse dot com.

  • normalize_email (bool, optional (default=False)) – if True, replace @ with di, . with dot. husein.zol05@gmail.com -> husein dot zol kosong lima di gmail dot com.

  • normalize_year (bool, optional (default=True)) – if True, tahun 1987 -> tahun sembilan belas lapan puluh tujuh. if True, 1970-an -> sembilan belas tujuh puluh an. if False, tahun 1987 -> tahun seribu sembilan ratus lapan puluh tujuh.

  • normalize_telephone (bool, optional (default=True)) – if True, no 012-1234567 -> no kosong satu dua, satu dua tiga empat lima enam tujuh

  • normalize_date (bool, optional (default=True)) – if True, 01/12/2001 -> satu disember dua ribu satu. if True, Jun 2017 -> satu Jun dua ribu tujuh belas. if True, 2017 Jun -> satu Jun dua ribu tujuh belas. if False, 2017 Jun -> 01/06/2017. if False, Jun 2017 -> 01/06/2017.

  • normalize_time (bool, optional (default=True)) – if True, pukul 2.30 -> pukul dua tiga puluh minit. if False, pukul 2.30 -> ‘02:00:00’

  • normalize_emoji (bool, (default=True)) – if True, 🔥 -> emoji api Load from malaya.preprocessing.demoji.

  • normalize_elongated (bool, optional (default=True)) – if True, betuii -> betui.

  • normalize_hingga (bool, optional (default=True)) – if True, 2011 - 2019 -> dua ribu sebelas hingga dua ribu sembilan belas

  • normalize_pada_hari_bulan (bool, optional (default=True)) – if True, pada 10/4 -> pada sepuluh hari bulan empat

  • normalize_fraction (bool, optional (default=True)) – if True, 10 /4 -> sepuluh per empat

  • normalize_money (bool, optional (default=True)) – if True, rm10.4m -> sepuluh juta empat ratus ribu ringgit

  • normalize_units (bool, optional (default=True)) – if True, 61.2 kg -> enam puluh satu perpuluhan dua kilogram

  • normalize_percent (bool, optional (default=True)) – if True, 0.8% -> kosong perpuluhan lapan peratus

  • normalize_ic (bool, optional (default=True)) – if True, 911111-01-1111 -> sembilan satu satu satu satu satu sempang kosong satu sempang satu satu satu satu

  • normalize_number (bool, optional (default=True)) – if True 0123 -> kosong satu dua tiga

  • normalize_x_kali (bool, optional (default=True)) – if True 10x -> ‘sepuluh kali’

  • normalize_cardinal (bool, optional (default=True)) – if True, 123 -> seratus dua puluh tiga

  • normalize_ordinal (bool, optional (default=True)) – if True, ke-123 -> keseratus dua puluh tiga

  • normalize_entity (bool, optional (default=True)) – normalize entities, only effect date, datetime, time and money patterns string only.

  • expand_contractions (bool, optional (default=True)) – expand english contractions.

  • check_english_func (Callable, optional (default=malaya.text.function.is_english)) – function to check a word in english dictionary, default is malaya.text.function.is_english. this parameter also will be use for malay text normalization.

  • check_malay_func (Callable, optional (default=malaya.text.function.is_malay)) – function to check a word in malay dictionary, default is malaya.text.function.is_malay.

  • translator (Callable, optional (default=None)) – function to translate EN word to MS word.

  • language_detection_word (Callable, optional (default=None)) – function to detect language for each words to get better translation results.

  • acceptable_language_detection (List[str], optional (default=['EN', 'CAPITAL', 'NOT_LANG'])) – only translate substrings if the results from language_detection_word is in acceptable_language_detection.

  • segmenter (Callable, optional (default=None)) – function to segmentize word. If provide, it will expand a word, apaitu -> apa itu

  • text_scorer (Callable, optional (default=None)) – function to validate upper word. If lower case score is higher or equal than upper case score, will choose lower case.

  • text_scorer_window (int, optional (default=2)) – size of lookback and lookforward to validate upper word.

  • not_a_word_threshold (float, optional (default=1e-4)) – assume a word is not a human word if score lower than not_a_word_threshold. only usable if passed text_scorer parameter.

  • dateparser_settings (Dict, optional (default={'TIMEZONE': 'GMT+8'})) – default dateparser setting, check support settings at https://dateparser.readthedocs.io/en/latest/

Returns

result

Return type

{‘normalize’, ‘date’, ‘money’}

malaya.qa.extractive#

malaya.qa.extractive.available_transformer()[source]#

List available Transformer Span models.

malaya.qa.extractive.available_huggingface()[source]#

List available huggingface models.

malaya.qa.extractive.transformer(model='xlnet', quantized=False, **kwargs)[source]#

Load Transformer Span model trained on SQUAD V2 dataset.

Parameters
  • model (str, optional (default='xlnet')) – Check available models at malaya.qa.extractive.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.SQUAD class

malaya.qa.extractive.huggingface(model='mesolitica/finetune-qa-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to answer extractive question answers.

Parameters
  • model (str, optional (default='mesolitica/finetune-qa-t5-small-standard-bahasa-cased')) – Check available models at malaya.qa.extractive.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.ExtractiveQA

malaya.similarity.doc2vec#

malaya.similarity.doc2vec.wordvector(wv)[source]#

Doc2vec interface for text similarity using Word Vector.

Parameters

wv (object) – malaya.wordvector.WordVector object. should have get_vector_by_name method.

Returns

result

Return type

malaya.similarity.doc2vec.Doc2VecSimilarity

malaya.similarity.doc2vec.vectorizer(v)[source]#

Doc2vec interface for text similarity using Encoder model.

Parameters

v (object) – encoder interface object, BERT, XLNET. should have vectorize method.

Returns

result

Return type

malaya.similarity.doc2vec.VectorizerSimilarity

class malaya.similarity.doc2vec.VectorizerSimilarity[source]#
predict_proba(left_strings, right_strings, similarity='cosine')[source]#

calculate similarity for two different batch of texts.

Parameters
  • left_strings (list of str) –

  • right_strings (list of str) –

  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.

    • 'euclidean' - euclidean similarity.

    • 'manhattan' - manhattan similarity.

Returns

result

Return type

List[float]

heatmap(strings, similarity='cosine', visualize=True, annotate=True, figsize=(7, 7))[source]#

plot a heatmap based on output from bert similarity.

Parameters
  • strings (list of str) – list of strings.

  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.

    • 'euclidean' - euclidean similarity.

    • 'manhattan' - manhattan similarity.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

result – list of results

Return type

list

class malaya.similarity.doc2vec.Doc2VecSimilarity[source]#
predict_proba(left_strings, right_strings, aggregation=<function mean>, similarity='cosine', soft=False)[source]#

calculate similarity for two different batch of texts.

Parameters
  • left_strings (list of str) –

  • right_strings (list of str) –

  • aggregation (Callable, optional (default=numpy.mean)) –

  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.

    • 'euclidean' - euclidean similarity.

    • 'manhattan' - manhattan similarity.

  • soft (bool, optional (default=False)) – word not inside word vector will replace with nearest word if True, else, will skip.

Returns

result

Return type

List[float]

heatmap(strings, aggregation=<function mean>, similarity='cosine', soft=False, visualize=True, annotate=True, figsize=(7, 7))[source]#

plot a heatmap based on output from bert similarity.

Parameters
  • strings (list of str) – list of strings

  • aggregation (Callable, optional (default=numpy.mean)) –

  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.

    • 'euclidean' - euclidean similarity.

    • 'manhattan' - manhattan similarity.

  • soft (bool, optional (default=True)) – word not inside word vector will replace with nearest word if True, else, will skip.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

result – list of results.

Return type

list

malaya.similarity.semantic#

malaya.spelling_correction.jamspell#

class malaya.spelling_correction.jamspell.JamSpell[source]#
correct(word, string, index=- 1)[source]#

Correct a word within a text, returning the corrected word.

Parameters
  • word (str) –

  • string (List[str]) – Tokenized string, word must a word inside string.

  • index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).

Returns

result

Return type

str

correct_word(word, string, index=- 1)[source]#

Spell-correct word in re.match, and preserve proper upper, lower, title case.

correct_match(match, string, index=- 1)[source]#

Spell-correct word in re.match, and preserve proper upper, lower, title case.

correct_text(text)[source]#

Correct all the words within a text, returning the corrected text.

Parameters

text (str) –

Returns

result

Return type

str

edit_candidates(word, string, index=- 1)[source]#

Generate candidates given a word.

Parameters
  • word (str) –

  • string (str) – Entire string, word must a word inside string.

  • index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).

Returns

result

Return type

List[str]

malaya.spelling_correction.probability#

class malaya.spelling_correction.probability.Spell[source]#
edit_step(word)[source]#

Generate possible combination of an input.

edits2(word)[source]#

All edits that are two edits away from word.

known(words)[source]#

The subset of words that appear in the dictionary of WORDS.

edit_candidates(word)[source]#

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

List[str]

correct_text(text)[source]#

Correct all the words within a text, returning the corrected text.

Parameters

text (str) –

Returns

result

Return type

str

correct_match(match)[source]#

Spell-correct word in re.match, and preserve proper upper, lower, title case.

correct_word(word)[source]#

Spell-correct word, and preserve proper upper, lower and title case.

Parameters

word (str) –

Returns

result

Return type

str

class malaya.spelling_correction.probability.Probability[source]#

The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.

P(word)[source]#

Probability of word.

correct(word, score_func=None, **kwargs)[source]#

Most probable spelling correction for word.

Parameters

word (str) –

Returns

result

Return type

str

class malaya.spelling_correction.probability.ProbabilityLM[source]#

The SpellCorrector extends the functionality of the Peter Norvig’s with Language Model. spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.

correct(word, string, index=- 1, lookback=3, lookforward=3, **kwargs)[source]#

Correct a word within a text, returning the corrected word.

Parameters
  • word (str) –

  • string (List[str]) – Entire string, word must a word inside string.

  • index (int, optional (default=-1)) – index of word in the string, if -1, will try to use string.index(word).

  • lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

Returns

result

Return type

str

correct_text(text, lookback=3, lookforward=3)[source]#

Correct all the words within a text, returning the corrected text.

Parameters
  • text (str) –

  • lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

Returns

result

Return type

str

correct_word(word, string, index=- 1, lookback=3, lookforward=3)[source]#

Spell-correct word, and preserve proper upper, lower and title case.

Parameters
  • word (str) –

  • string (List[str]) – Tokenized string, word must a word inside string.

  • index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).

  • lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

Returns

result

Return type

str

correct_match(match, string, index=- 1, lookback=3, lookforward=3)[source]#

Spell-correct word in re.match, and preserve proper upper, lower, title case.

malaya.spelling_correction.spylls#

class malaya.spelling_correction.spylls.Spylls[source]#
correct(word)[source]#

Correct a word within a text, returning the corrected word.

Parameters

word (str) –

Returns

result

Return type

str

edit_candidates(word)[source]#

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

List[str]

malaya.spelling_correction.symspell#

class malaya.spelling_correction.symspell.Symspell[source]#

The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation

edit_step(word)[source]#

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

{candidate1, candidate2}

edit_candidates(word, get_score=False)[source]#

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

List[str]

correct(word, **kwargs)[source]#

Most probable spelling correction for word.

Parameters

word (str) –

Returns

result

Return type

str

malaya.spelling_correction.transformer#

class malaya.spelling_correction.transformer.Transformer[source]#
correct(word, string, index=- 1, lookback=5, lookforward=5, batch_size=20, **kwargs)[source]#

Correct a word within a text, returning the corrected word.

Parameters
  • word (str) –

  • string (List[str]) – Tokenized string, word must a word inside string.

  • index (int, optional (default=-1)) – index of word in the string, if -1, will try to use string.index(word).

  • lookback (int, optional (default=5)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=5)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

  • batch_size (int, optional (default=20)) – batch size to insert into model.

Returns

result

Return type

str

correct_text(text, lookback=5, lookforward=5, batch_size=20)[source]#

Correct all the words within a text, returning the corrected text.

Parameters
  • text (str) –

  • lookback (int, optional (default=5)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=5)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

  • batch_size (int, optional(default=20)) – batch size to insert into model.

Returns

result

Return type

str

correct_word(word, string, index=- 1, lookback=5, lookforward=5, batch_size=20)[source]#

Spell-correct word, and preserve proper upper, lower and title case.

Parameters
  • word (str) –

  • string (List[str]) – Tokenized string, word must a word inside string.

  • index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).

  • lookback (int, optional (default=5)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=5)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

  • batch_size (int, optional(default=20)) – batch size to insert into model.

Returns

result

Return type

str

correct_match(match, string, index=- 1, lookback=5, lookforward=5, batch_size=20)[source]#

Spell-correct word in re.match, and preserve proper upper, lower, title case.

malaya.summarization.abstractive#

malaya.summarization.abstractive.available_transformer()[source]#

List available transformer models.

malaya.summarization.abstractive.available_huggingface()[source]#

List available huggingface models.

malaya.summarization.abstractive.transformer(model='small-t5', quantized=False, **kwargs)[source]#

Load Malaya transformer encoder-decoder model to generate a summary given a string.

Parameters
  • model (str, optional (default='small-t5')) – Check available models at malaya.summarization.abstractive.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if t5 in model, will return malaya.model.t5.Summarization.

  • if bigbird in model, will return malaya.model.bigbird.Summarization.

  • if pegasus in model, will return malaya.model.pegasus.Summarization.

Return type

model

malaya.summarization.abstractive.huggingface(model='mesolitica/finetune-summarization-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to abstractive summarization.

Parameters
  • model (str, optional (default='mesolitica/finetune-summarization-t5-small-standard-bahasa-cased')) – Check available models at malaya.summarization.abstractive.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Summarization

malaya.summarization.extractive#

malaya.summarization.extractive.encoder(vectorizer)[source]#

Encoder interface for summarization.

Parameters

vectorizer (object) – encoder interface object, eg, BERT, XLNET, ALBERT, ALXLNET. should have vectorize method.

Returns

result

Return type

malaya.model.extractive_summarization.Encoder

malaya.summarization.extractive.doc2vec(wordvector)[source]#

Doc2Vec interface for summarization.

Parameters

wordvector (object) – malaya.wordvector.WordVector object. should have get_vector_by_name method.

Returns

result

Return type

malaya.model.extractive_summarization.Doc2Vec

malaya.summarization.extractive.sklearn(model, vectorizer)[source]#

sklearn interface for summarization.

Parameters
  • model (object) –

    Should have fit_transform method. Commonly:

    • sklearn.decomposition.TruncatedSVD - LSA algorithm.

    • sklearn.decomposition.LatentDirichletAllocation - LDA algorithm.

  • vectorizer (object) –

    Should have fit_transform method. Commonly:

    • sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algorithm.

    • sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm.

Returns

result

Return type

malaya.model.extractive_summarization.SKLearn

malaya.topic_model.decomposition#

malaya.topic_model.decomposition.fit(corpus, model, vectorizer, n_topics, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, **kwargs)[source]#

Train a SKlearn model to do topic modelling based on corpus given.

Parameters
  • corpus (list) –

  • model (object) –

    Should have fit_transform method. Commonly:

    • sklearn.decomposition.TruncatedSVD - LSA algorithm.

    • sklearn.decomposition.LatentDirichletAllocation - LDA algorithm.

    • sklearn.decomposition.NMF - NMF algorithm.

  • vectorizer (object) –

    Should have fit_transform method. Commonly:

    • sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algorithm.

    • sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm.

  • n_topics (int, (default=10)) – size of decomposition column.

  • cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

malaya.topic_model.decomposition.Topic class

class malaya.topic_model.decomposition.Topic[source]#
visualize_topics(notebook_mode=False, mds='pcoa')[source]#

Print important topics based on decomposition.

Parameters

mds (str, optional (default='pcoa')) –

2D Decomposition. Allowed values:

  • 'pcoa' - Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)

  • 'mmds' - Dimension reduction via Multidimensional scaling

  • 'tsne' - Dimension reduction via t-distributed stochastic neighbor embedding

top_topics(len_topic, top_n=10, return_df=True)[source]#

Print important topics based on decomposition.

Parameters
  • len_topic (int) – size of topics.

  • top_n (int, optional (default=10)) – top n of each topic.

  • return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.

get_topics(len_topic)[source]#

Return important topics based on decomposition.

Parameters

len_topic (int) –

Returns

result

Return type

List[str]

get_sentences(len_sentence, k=0)[source]#

Return important sentences related to selected column based on decomposition.

Parameters
  • len_sentence (int) –

  • k (int, (default=0)) – index of decomposition matrix.

Returns

result

Return type

List[str]

malaya.topic_model.lda2vec#

malaya.topic_model.lda2vec.fit(corpus, vectorizer, n_topics=10, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, window_size=2, embedding_size=128, epoch=10, switch_loss=1000, random_state=10, **kwargs)[source]#

Train a LDA2Vec model to do topic modelling based on corpus / list of strings given.

Parameters
  • corpus (list) –

  • vectorizer (object) –

    Should have fit_transform method. Commonly:

    • sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algorithm.

    • sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm.

  • n_topics (int, (default=10)) – size of decomposition column.

  • cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

  • embedding_size (int, (default=128)) – embedding size of lda2vec tensors.

  • epoch (int, (default=10)) – training iteration, how many loop need to train.

  • switch_loss (int, (default=3)) – baseline to switch from document based loss to document + word based loss.

  • random_state (int, (default=10)) – random_state for sklearn.utils.shuffle parameter

Returns

result

Return type

malaya.topic_modeling.lda2vec.DeepTopic class

class malaya.topic_model.lda2vec.DeepTopic[source]#
visualize_topics(notebook_mode=False, mds='pcoa')[source]#

Print important topics based on decomposition.

Parameters

mds (str, optional (default='pcoa')) –

2D Decomposition. Allowed values:

  • 'pcoa' - Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)

  • 'mmds' - Dimension reduction via Multidimensional scaling

  • 'tsne' - Dimension reduction via t-distributed stochastic neighbor embedding

top_topics(len_topic, top_n=10, return_df=True)[source]#

Print important topics based on decomposition.

Parameters
  • len_topic (int) – size of topics.

  • top_n (int, optional (default=10)) – top n of each topic.

  • return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.

get_topics(len_topic)[source]#

Return important topics based on decomposition.

Parameters

len_topic (int) – size of topics.

Returns

result

Return type

List[str]

get_sentences(len_sentence, k=0)[source]#

Return important sentences related to selected column based on decomposition.

Parameters
  • len_sentence (int) –

  • k (int, (default=0)) – index of decomposition matrix.

Returns

result

Return type

List[str]

malaya.topic_model.transformer#

class malaya.topic_model.transformer.AttentionTopic[source]#
top_topics(len_topic, top_n=10, return_df=True)[source]#

Print important topics based on decomposition.

Parameters
  • len_topic (int) – size of topics.

  • top_n (int, optional (default=10)) – top n of each topic.

  • return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.

get_topics(len_topic)[source]#

Return important topics based on decomposition.

Parameters

len_topic (int) – size of topics.

Returns

result

Return type

List[str]

malaya.translation.en_ms#

malaya.translation.en_ms.available_transformer()[source]#

List available transformer models.

malaya.translation.en_ms.available_huggingface()[source]#

List available HuggingFace models.

malaya.translation.en_ms.dictionary(**kwargs)[source]#

Load dictionary {EN: MS} .

Returns

result

Return type

Dict[str, str]

malaya.translation.en_ms.transformer(model='base', quantized=False, **kwargs)[source]#

Load transformer encoder-decoder model to translate EN-to-MS.

Parameters
  • model (str, optional (default='base')) – Check available models at malaya.translation.en_ms.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bigbird in model, return malaya.model.bigbird.Translation.

  • else, return malaya.model.tf.Translation.

Return type

model

malaya.translation.en_ms.huggingface(model='mesolitica/finetune-translation-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to translate EN-to-MS.

Parameters
  • model (str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased')) – Check available models at malaya.translation.en_ms.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Generator

malaya.translation.ms_en#

malaya.translation.ms_en.available_transformer()[source]#

List available transformer models.

malaya.translation.ms_en.available_huggingface()[source]#

List available HuggingFace models.

malaya.translation.ms_en.transformer(model='base', quantized=False, **kwargs)[source]#

Load Transformer encoder-decoder model to translate MS-to-EN.

Parameters
  • model (str, optional (default='base')) – Check available models at malaya.translation.ms_en.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bigbird in model, return malaya.model.bigbird.Translation.

  • else, return malaya.model.tf.Translation.

Return type

model

malaya.translation.ms_en.huggingface(model='mesolitica/finetune-translation-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to translate MS-to-EN.

Parameters
  • model (str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased')) – Check available models at malaya.translation.ms_en.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Generator

malaya.translation.ms_en.dictionary(**kwargs)[source]#

Load dictionary {MS: EN} .

Returns

result

Return type

Dict[str, str]

malaya.zero_shot.classification#

malaya.zero_shot.classification.available_transformer()[source]#

List available transformer zero-shot models.

malaya.zero_shot.classification.available_huggingface()[source]#

List available huggingface zero-shot models.

malaya.zero_shot.classification.transformer(model='bert', quantized=False, **kwargs)[source]#

Load Transformer zero-shot model.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.zero_shot.classification.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.ZeroshotBERT.

  • if xlnet in model, will return malaya.model.xlnet.ZeroshotXLNET.

Return type

model

malaya.zero_shot.classification.huggingface(model='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to zeroshot text classification.

Parameters
  • model (str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')) – Check available models at malaya.zero_shot.classification.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.ZeroShotClassification

malaya.zero_shot.entity#

malaya.zero_shot.entity.available_huggingface()[source]#

List available huggingface models.

malaya.zero_shot.entity.huggingface(model='mesolitica/finetune-zeroshot-ner-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to zeroshot NER.

Parameters
  • model (str, optional (default='mesolitica/finetune-zeroshot-ner-t5-small-standard-bahasa-cased')) – Check available models at malaya.zero_shot.entity.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.ZeroShotNER

malaya.cluster#

malaya.cluster.cluster_words(list_words, lowercase=False)[source]#

cluster similar words based on structure, eg, [‘mahathir mohamad’, ‘mahathir’] = [‘mahathir mohamad’]. big O = n^2

Parameters
  • list_words (List[str]) –

  • lowercase (bool, optional (default=True)) – if True, will group using lowercase but maintain the original form.

Returns

string

Return type

List[str]

malaya.cluster.cluster_pos(result)[source]#

cluster similar POS.

Parameters

result (List[Tuple[str, str]]) –

Returns

result

Return type

Dict[str, List[str]]

malaya.cluster.cluster_entities(result)[source]#

cluster similar Entities.

Parameters

result (List[Tuple[str, str]]) –

Returns

result

Return type

Dict[str, List[str]]

malaya.cluster.cluster_tagging(result)[source]#

cluster any tagging results, as long the data passed [(string, label), (string, label)].

Parameters

result (List[Tuple[str, str]]) –

Returns

result

Return type

Dict[str, List[str]]

malaya.cluster.cluster_scatter(corpus, vectorizer, num_clusters=5, titles=None, colors=None, stopwords=<function get_stopwords>, cleaning=<function simple_textcleaning>, clustering=<class 'sklearn.cluster._kmeans.KMeans'>, decomposition=<class 'sklearn.manifold._mds.MDS'>, ngram=(1, 3), figsize=(17, 9), batch_size=20)[source]#

plot scatter plot on similar text clusters.

Parameters
  • corpus (List[str]) –

  • vectorizer (class) – vectorizer class.

  • num_clusters (int, (default=5)) – size of unsupervised clusters.

  • titles (List[str], (default=None)) – list of titles, length must same with corpus.

  • colors (List[str], (default=None)) – list of colors, length must same with num_clusters.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

  • ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.

  • cleaning (function, (default=malaya.texts.function.simple_textcleaning)) – function to clean the corpus.

  • batch_size (int, (default=10)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.

Returns

dictionary

Return type

{‘X’: X, ‘Y’: Y, ‘labels’: clusters, ‘vector’: transformed_text_clean, ‘titles’: titles}

malaya.cluster.cluster_dendogram(corpus, vectorizer, titles=None, stopwords=<function get_stopwords>, cleaning=<function simple_textcleaning>, random_samples=0.3, ngram=(1, 3), figsize=(17, 9), batch_size=20)[source]#

plot hierarchical dendogram with similar texts.

Parameters
  • corpus (List[str]) –

  • vectorizer (class) – vectorizer class.

  • num_clusters (int, (default=5)) – size of unsupervised clusters.

  • titles (List[str], (default=None)) – list of titles, length must same with corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

  • cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.

  • random_samples (float, (default=0.3)) – random samples from the corpus, 0.3 means 30%.

  • ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.

  • batch_size (int, (default=20)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.

Returns

dictionary

Return type

{‘linkage_matrix’: linkage_matrix, ‘titles’: titles}

malaya.cluster.cluster_graph(corpus, vectorizer, threshold=0.9, num_clusters=5, titles=None, colors=None, stopwords=<function get_stopwords>, ngram=(1, 3), cleaning=<function simple_textcleaning>, clustering=<class 'sklearn.cluster._kmeans.KMeans'>, figsize=(17, 9), with_labels=True, batch_size=20)[source]#

plot undirected graph with similar texts.

Parameters
  • corpus (List[str]) –

  • vectorizer (class) – vectorizer class.

  • threshold (float, (default=0.9)) – 0.9 means, 90% above absolute pearson correlation.

  • num_clusters (int, (default=5)) – size of unsupervised clusters.

  • titles (List[str], (default=True)) – list of titles, length must same with corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str] or List[str] or Tuple[str].

  • cleaning (function, (default=malaya.texts.function.simple_textcleaning)) – function to clean the corpus.

  • ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.

  • batch_size (int, (default=20)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.

Returns

dictionary

Return type

{‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels}

malaya.cluster.cluster_entity_linking(corpus, vectorizer, entity_model, topic_modeling_model, threshold=0.3, topic_decomposition=2, topic_length=10, fuzzy_ratio=70, accepted_entities=['law', 'location', 'organization', 'person', 'event'], cleaning=<function simple_textcleaning>, colors=None, stopwords=<function get_stopwords>, max_df=1.0, min_df=1, ngram=(2, 3), figsize=(17, 9), batch_size=20)[source]#

plot undirected graph for Entities and topics relationship.

Parameters
  • corpus (list or str) –

  • vectorizer (class) –

  • titles (list) – list of titles, length must same with corpus.

  • colors (list) – list of colors, length must same with num_clusters.

  • threshold (float, (default=0.3)) – 0.3 means, 30% above absolute pearson correlation.

  • topic_decomposition (int, (default=2)) – size of decomposition.

  • topic_length (int, (default=10)) – size of topic models.

  • fuzzy_ratio (int, (default=70)) – size of ratio for fuzzywuzzy.

  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.

  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.

  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.

  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str] or List[str] or Tuple[str]

Returns

dictionary

Return type

{‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels}

malaya.constituency#

malaya.constituency.available_transformer()[source]#

List available transformer models.

malaya.constituency.transformer(model='xlnet', quantized=False, **kwargs)[source]#

Load Transformer Constituency Parsing model, transfer learning Transformer + self attentive parsing.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.constituency.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.Constituency class

malaya.coref#

malaya.coref.parse_from_dependency(models, string, references=['dia', 'itu', 'ini', 'saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'], rejected_references=['saya', 'awak', 'kamu', 'kita', 'kami', 'mereka', 'nya'], acceptable_subjects=['flat', 'subj', 'nsubj', 'csubj', 'obj'], acceptable_nested_subjects=['compound', 'flat'], split_nya=True, aggregate=<function mean>, top_k=20)[source]#

Apply Coreference Resolution using stacks of dependency models.

Parameters
  • models (list) – list of dependency models, must has vectorize method.

  • string (str) –

  • references (List[str], optional (default=['dia', 'itu', 'ini', 'saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'])) – list of references.

  • rejected_references (List[str], optional (default=['saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'])) – list of rejected references during populating subjects.

  • acceptable_subjects (List[str], optional) – List of dependency labels for subjects.

  • acceptable_nested_subjects (List[str], optional) – List of dependency labels for nested subjects, eg, syarikat (obl) facebook (compound).

  • split_nya (bool, optional (default=True)) – split nya, eg, disifatkannya -> disifatkan, nya.

  • aggregate (Callable, optional (default=numpy.mean)) – Aggregate function to aggregate list of vectors from model.vectorize.

  • top_k (int, optional (default=20)) – only accept near top_k to assume a coherence.

Returns

result – {‘text’: [‘Husein’,’Zolkepli’,’suka’,’makan’,’ayam’,’.’,’Dia’,’pun’,’suka’,’makan’,’daging’,’.’], ‘coref’: {6: {‘index’: [0, 1], ‘text’: [‘Husein’, ‘Zolkepli’]}}}

Return type

Dict[text, coref]

malaya.dependency#

malaya.dependency.describe()[source]#

Describe Dependency supported.

malaya.dependency.dependency_graph(tagging, indexing)[source]#

Return helper object for dependency parser results. Only accept tagging and indexing outputs from dependency models.

malaya.dependency.available_transformer(version='v2')[source]#

List available transformer dependency parsing models.

Parameters

version (str, optional (default='v2')) –

Version supported. Allowed values:

  • 'v1' - version 1, maintain for knowledge graph.

  • 'v2' - Trained on bigger dataset, better version.

malaya.dependency.available_huggingface()[source]#

List available huggingface models.

malaya.dependency.transformer(version='v2', model='xlnet', quantized=False, **kwargs)[source]#

Load Transformer Dependency Parsing model, transfer learning Transformer + biaffine attention.

Parameters
  • version (str, optional (default='v2')) –

    Version supported. Allowed values:

    • 'v1' - version 1, maintain for knowledge graph, malaya_graph.text_to_kg.parser.from_dependency

    • 'v2' - Trained on bigger dataset, better version.

  • model (str, optional (default='xlnet')) – Check available models at malaya.dependency.available_transformer(version=’{version}’).

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.DependencyBERT.

  • if xlnet in model, will return malaya.model.xlnet.DependencyXLNET.

Return type

model

malaya.dependency.huggingface(model='mesolitica/finetune-dependency-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to dependency parsing.

Parameters
  • model (str, optional (default='mesolitica/finetune-dependency-t5-small-standard-bahasa-cased')) – Check available models at malaya.dependency.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Dependency

malaya.emotion#

malaya.emotion.available_transformer()[source]#

List available transformer emotion analysis models.

malaya.emotion.multinomial(**kwargs)[source]#

Load multinomial emotion model.

Returns

result

Return type

malaya.model.ml.MulticlassBayes class

malaya.emotion.transformer(model='xlnet', quantized=False, **kwargs)[source]#

Load Transformer emotion model.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.emotion.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.MulticlassBERT.

  • if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET.

  • if fastformer in model, will return malaya.model.fastformer.MulticlassFastFormer.

Return type

model

malaya.entity#

malaya.entity.describe()[source]#

Describe Entities supported.

malaya.entity.describe_ontonotes5()[source]#

Describe OntoNotes5 Entities supported. https://spacy.io/api/annotation#named-entities

malaya.entity.available_transformer()[source]#

List available transformer Entity Tagging models.

malaya.entity.available_transformer_ontonotes5()[source]#

List available transformer Entity Tagging models trained on Ontonotes 5 Bahasa.

malaya.entity.transformer(model='xlnet', quantized=False, **kwargs)[source]#

Load Transformer Entity Tagging model trained on Malaya Entity, transfer learning Transformer + CRF.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.entity.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.TaggingBERT.

  • if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.

  • if fastformer in model, will return malaya.model.fastformer.TaggingFastFormer.

Return type

model

malaya.entity.transformer_ontonotes5(model='xlnet', quantized=False, **kwargs)[source]#

Load Transformer Entity Tagging model trained on Ontonotes 5 Bahasa, transfer learning Transformer + CRF.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.entity.available_transformer_ontonotes5().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.TaggingBERT.

  • if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.

  • if fastformer in model, will return malaya.model.fastformer.TaggingFastFormer.

Return type

model

malaya.entity.general_entity(model=None)[source]#

Load Regex based general entities tagging along with another supervised entity tagging model.

Parameters

model (object) – model must have predict method. Make sure the predict method returned [(string, label), (string, label)].

Returns

result

Return type

malaya.text.entity.EntityRegex class

malaya.jawi_rumi#

malaya.jawi_rumi.available_transformer()[source]#

List available transformer models.

malaya.jawi_rumi.transformer(model='base', quantized=False, **kwargs)[source]#

Load transformer encoder-decoder model to convert jawi to rumi.

Parameters
  • model (str, optional (default='base')) – Check available models at malaya.jawi_rumi.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.JawiRumi class

malaya.language_detection#

malaya.language_detection.fasttext(quantized=True, **kwargs)[source]#

Load Fasttext language detection model. Original size is 353MB, Quantized size 31.1MB.

Parameters

quantized (bool, optional (default=True)) – if True, load quantized fasttext model. Else, load original fasttext model.

Returns

result

Return type

malaya.model.ml.LanguageDetection class

malaya.language_detection.deep_model(quantized=False, **kwargs)[source]#

Load deep learning language detection model. Original size is 51.2MB, Quantized size 12.8MB.

Parameters

quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.DeepLang class

malaya.language_detection.substring_rules(model, **kwargs)[source]#

detect EN, MS, MANDARIN and OTHER languages in a string.

EN words detection are using pyenchant from https://pyenchant.github.io/pyenchant/ and user language detection model.

MS words detection are using malaya.text.function.is_malay and user language detection model.

OTHER words detection are using any language detection classification model, such as, malaya.language_detection.fasttext or malaya.language_detection.deep_model.

Parameters

model (Callable) – Callable model, must have predict method.

Returns

result

Return type

malaya.model.rules.LanguageDict class

malaya.language_model#

malaya.language_model.available_kenlm()[source]#

List available KenLM Language Model.

malaya.language_model.available_gpt2()[source]#

List available GPT2 Language Model.

malaya.language_model.available_mlm()[source]#

List available MLM Language Model.

malaya.language_model.kenlm(model='dump-combined', **kwargs)[source]#

Load KenLM language model.

Parameters

model (str, optional (default='dump-combined')) – Check available models at malaya.language_model.available_models().

Returns

result

Return type

kenlm.Model class

malaya.language_model.gpt2(model='mesolitica/gpt2-117m-bahasa-cased', force_check=True, **kwargs)[source]#

Load GPT2 language model.

Parameters
  • model (str, optional (default='mesolitica/gpt2-117m-bahasa-cased')) – Check available models at malaya.language_model.available_gpt2().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.gpt2_lm.LM class

malaya.language_model.mlm(model='malay-huggingface/bert-tiny-bahasa-cased', force_check=True, **kwargs)[source]#

Load Masked language model.

Parameters
  • model (str, optional (default='malay-huggingface/bert-tiny-bahasa-cased')) – Check available models at malaya.language_model.available_mlm().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.mask_lm.MLMScorer class

malaya.lexicon#

malaya.lexicon.random_walk(lexicon, wordvector, pool_size=10, top_n=20, similarity_power=10.0, beta=0.9, arccos=True, normalization=True, soft=False, silent=False)[source]#

Induce lexicon by using random walk technique, use in paper, https://arxiv.org/pdf/1606.02820.pdf

Parameters
  • lexicon (Dict[str : List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.

  • wordvector (object) – wordvector interface object.

  • pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.

  • top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.

  • similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.

  • beta (float, optional (default=0.9)) – penalty score, towards to 1.0 means less penalty. 0 < beta < 1.

  • arccos (bool, optional (default=True)) – covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.

  • normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.

  • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.

  • silent (bool, optional (default=False)) – if True, will not print any logs.

Returns

result

Return type

tuple(labels[argmax(scores), axis = 1], scores, labels)

malaya.lexicon.propagate_probabilistic(lexicon, wordvector, pool_size=10, top_n=20, similarity_power=10.0, arccos=True, normalization=True, soft=False, silent=False)[source]#

Learns polarity scores via standard label propagation from lexicon sets.

Parameters
  • lexicon (Dict[str, List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.

  • wordvector (object) – wordvector interface object.

  • pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.

  • top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.

  • similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.

  • arccos (bool, optional (default=True)) – covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.

  • normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.

  • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.

  • silent (bool, optional (default=False)) – if True, will not print any logs.

Returns

result

Return type

tuple(labels[argmax(scores), axis = 1], scores, labels)

malaya.lexicon.propagate_graph(lexicon, wordvector, pool_size=10, top_n=20, similarity_power=10.0, normalization=True, soft=False, silent=False)[source]#

Graph propagation method dapted from Velikovich, Leonid, et al. “The viability of web-derived polarity lexicons.” http://www.aclweb.org/anthology/N10-1119

Parameters
  • lexicon (Dict[str, List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.

  • wordvector (object) – wordvector interface object.

  • pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.

  • top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.

  • similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.

  • normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.

  • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.

  • silent (bool, optional (default=False)) – if True, will not print any logs.

Returns

result

Return type

tuple(labels[argmax(scores), axis = 1], scores, labels)

malaya.nsfw#

malaya.nsfw.lexicon(**kwargs)[source]#

Load Lexicon NSFW model.

Returns

result

Return type

malaya.text.lexicon.nsfw.Lexicon class

malaya.nsfw.multinomial(**kwargs)[source]#

Load multinomial NSFW model.

Returns

result

Return type

malaya.model.ml.BAYES class

malaya.num2word#

malaya.num2word.to_cardinal(number)[source]#

Translate from number input to cardinal text representation

Parameters

number (real number) –

Returns

result – cardinal representation

Return type

str

malaya.num2word.to_ordinal(number)[source]#

Translate from number input to ordinal text representation

Parameters

number (real number) –

Returns

result – ordinal representation

Return type

str

malaya.num2word.to_ordinal_num(number)[source]#

Translate from number input to ordinal numering text representation

Parameters

number (int) –

Returns

result – ordinal numering representation

Return type

str

malaya.num2word.to_currency(value)[source]#

Translate from number input to cardinal currency text representation

Parameters

number (int) –

Returns

result – cardinal currency representation

Return type

str

malaya.num2word.to_year(value)[source]#

Translate from number input to cardinal year text representation

Parameters

number (int) –

Returns

result – cardinal year representation

Return type

str

malaya.paraphrase#

malaya.paraphrase.available_transformer()[source]#

List available transformer models.

malaya.paraphrase.available_huggingface()[source]#

List available huggingface models.

malaya.paraphrase.transformer(model='small-t5', quantized=False, **kwargs)[source]#

Load Malaya transformer encoder-decoder model to paraphrase.

Parameters
  • model (str, optional (default='small-t5')) – Check available models at malaya.paraphrase.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if t5 in model, will return malaya.model.t5.Paraphrase.

Return type

model

malaya.paraphrase.huggingface(model='mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to paraphrase.

Parameters
  • model (str, optional (default='mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased')) – Check available models at malaya.paraphrase.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Paraphrase

malaya.phoneme#

malaya.phoneme.deep_model_dbp(quantized=False, **kwargs)[source]#

Load LSTM + Bahdanau Attention phonetic model, 256 filter size, 2 layers, character level. original data from https://prpm.dbp.gov.my/ Glosari Dialek.

Original size 10.4MB, quantized size 2.77MB .

Parameters

quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.Seq2SeqLSTM class

malaya.phoneme.deep_model_ipa(quantized=False, **kwargs)[source]#

Load LSTM + Bahdanau Attention phonetic model, 256 filter size, 2 layers, character level. Original data from https://github.com/open-dict-data/ipa-dict/blob/master/data/ma.txt

Original size 10.4MB, quantized size 2.77MB .

Parameters

quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.Seq2SeqLSTM_Split class

malaya.pos#

malaya.pos.describe()[source]#

Describe Part-Of-Speech supported.

malaya.pos.available_transformer()[source]#

List available transformer Part-Of-Speech Tagging models.

malaya.pos.naive(string)[source]#

Recognize POS in a string using Regex.

Parameters

string (str) –

Returns

string

Return type

List[Tuple[str, str]]

malaya.pos.transformer(model='xlnet', quantized=False, **kwargs)[source]#

Load Transformer POS Tagging model, transfer learning Transformer + CRF.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.pos.available_transformer()`.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.TaggingBERT.

  • if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.

Return type

model

malaya.preprocessing#

malaya.preprocessing.preprocessing(normalize=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'], annotate=['allcaps', 'elongated', 'repeated', 'emphasis', 'censored', 'hashtag'], lowercase=True, fix_unidecode=True, expand_english_contractions=True, segmenter=None, demoji=None, **kwargs)[source]#

Load Preprocessing class.

Parameters
  • normalize (List[str], optional (default=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'])) – normalizing tokens, can check all supported normalizing at malaya.preprocessing.get_normalize().

  • annotate (List[str], optional (default=['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'])) – annonate tokens <open></open>, only accept [‘hashtag’, ‘allcaps’, ‘elongated’, ‘repeated’, ‘emphasis’, ‘censored’].

  • lowercase (bool, optional (default=True)) –

  • fix_unidecode (bool, optional (default=True)) – fix unidecode using ftfy.fix_text.

  • expand_english_contractions (bool, optional (default=True)) – expand english contractions.

  • segmenter (Callable, optional (default=None)) – function to segmentize word. If provide, it will expand hashtags, #mondayblues == monday blues

  • demoji (object) – demoji object, need to have a method demoji.

Returns

result

Return type

malaya.preprocessing.Preprocessing class

malaya.preprocessing.demoji()[source]#

Download latest emoji malay description from https://github.com/huseinzol05/malay-dataset/tree/master/dictionary/emoji

Returns

result

Return type

malaya.preprocessing.Demoji class

class malaya.preprocessing.Preprocessing[source]#
class malaya.preprocessing.Demoji[source]#
demoji(string)[source]#

Find emojis with string representation. 🔥 -> emoji api.

Parameters

string (str) –

Returns

result

Return type

Dist[str]

malaya.relevancy#

malaya.relevancy.available_transformer()[source]#

List available transformer relevancy analysis models.

malaya.relevancy.transformer(model='xlnet', quantized=False, **kwargs)[source]#

Load Transformer relevancy model.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.relevancy.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.MulticlassBERT.

  • if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET.

  • if bigbird in model, will return malaya.model.xlnet.MulticlassBigBird.

  • if fastformer in model, will return malaya.model.fastformer.MulticlassFastFormer.

Return type

model

malaya.rumi_jawi#

malaya.rumi_jawi.available_transformer()[source]#

List available transformer models.

malaya.rumi_jawi.transformer(model='base', quantized=False, **kwargs)[source]#

Load transformer encoder-decoder model to convert rumi to jawi.

Parameters
  • model (str, optional (default='base')) – Check available models at malaya.rumi_jawi.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.RumiJawi class

malaya.segmentation#

malaya.segmentation.viterbi(max_split_length=20, **kwargs)[source]#

Load Segmenter class using viterbi algorithm.

Parameters
  • max_split_length (int, (default=20)) – max length of words in a sentence to segment

  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.

Returns

result

Return type

malaya.segmentation.Segmenter class

malaya.segmentation.available_transformer()[source]#

List available transformer models.

malaya.segmentation.available_huggingface()[source]#

List available huggingface models.

malaya.segmentation.transformer(model='small', quantized=False, **kwargs)[source]#

Load transformer encoder-decoder model to segmentation.

Parameters
  • model (str, optional (default='base')) – Check available models at malaya.segmentation.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.Segmentation class

malaya.segmentation.huggingface(model='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to segmentation.

Parameters
  • model (str, optional (default='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased')) – Check available models at malaya.segmentation.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Generator

class malaya.segmentation.Segmenter[source]#
segment(strings)[source]#

Segment strings. Example, “sayasygkan negarasaya” -> “saya sygkan negara saya”

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

malaya.sentiment#

malaya.sentiment.available_transformer()[source]#

List available transformer sentiment analysis models.

malaya.sentiment.multinomial(**kwargs)[source]#

Load multinomial sentiment model.

Returns

result

Return type

malaya.model.ml.Bayes class

malaya.sentiment.transformer(model='bert', quantized=False, **kwargs)[source]#

Load Transformer sentiment model.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.sentiment.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.MulticlassBERT.

  • if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET.

  • if fastformer in model, will return malaya.model.fastformer.MulticlassFastFormer.

Return type

model

malaya.stack#

malaya.stack.voting_stack(models, text)[source]#

Stacking for POS, Entities and Dependency models.

Parameters
  • models (list) – list of models.

  • text (str) – string to predict.

Returns

result

Return type

list

malaya.stack.predict_stack(models, strings, aggregate=<function gmean>, **kwargs)[source]#

Stacking for predictive models.

Parameters
  • models (List[Callable]) – list of models.

  • strings (List[str]) –

  • aggregate (Callable, optional (default=scipy.stats.mstats.gmean)) – Aggregate function.

Returns

result

Return type

dict

malaya.stem#

malaya.stem.available_deep_model()[source]#

List available stemmer deep models.

malaya.stem.naive()[source]#

Load stemming model using startswith and endswith naively using regex patterns.

Returns

result

Return type

malaya.stem.Naive class

malaya.stem.sastrawi()[source]#

Load stemming model using Sastrawi, this also include lemmatization.

Returns

result

Return type

malaya.stem.Sastrawi class

malaya.stem.deep_model(model='base', quantized=False, **kwargs)[source]#

Load LSTM + Bahdanau Attention stemming model, BPE level (YouTokenToMe 1000 vocab size). This model also include lemmatization.

Parameters
  • model (str, optional (default='base')) – Check available models at malaya.stem.available_deep_model().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.stem.DeepStemmer class

class malaya.stem.DeepStemmer[source]#
greedy_decoder(string)[source]#

Stem a string, this also include lemmatization using greedy decoder.

Parameters

string (str) –

Returns

result

Return type

str

beam_decoder(string)[source]#

Stem a string, this also include lemmatization using beam decoder.

Parameters

string (str) –

Returns

result

Return type

str

predict(string, beam_search=False)[source]#

Stem a string, this also include lemmatization.

Parameters
  • string (str) –

  • beam_search (bool, (optional=False)) – If True, use beam search decoder, else use greedy decoder.

Returns

result

Return type

str

stem_word(word, beam_search=False, **kwargs)[source]#

Stem a word, this also include lemmatization.

Parameters

string (str) –

Returns

result

Return type

str

stem(string, beam_search=False)[source]#

Stem a string, this also include lemmatization.

Parameters
  • string (str) –

  • beam_search (bool, (optional=False)) – If True, use beam search decoder, else use greedy decoder.

Returns

result

Return type

str

class malaya.stem.Sastrawi[source]#
stem_word(word, **kwargs)[source]#

Stem a word using Sastrawi, this also include lemmatization.

Parameters

string (str) –

Returns

result

Return type

str

stem(string)[source]#

Stem a string using Sastrawi, this also include lemmatization.

Parameters

string (str) –

Returns

result

Return type

str

class malaya.stem.Naive[source]#
stem_word(word, **kwargs)[source]#

Stem a word using Regex pattern.

Parameters

string (str) –

Returns

result

Return type

str

stem(string)[source]#

Stem a string using Regex pattern.

Parameters

string (str) –

Returns

result

Return type

str

malaya.subjectivity#

malaya.subjectivity.available_transformer()[source]#

List available transformer subjective analysis models.

malaya.subjectivity.multinomial(**kwargs)[source]#

Load multinomial subjectivity model.

Parameters

validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.

Returns

result

Return type

malaya.model.ml.Bayes class

malaya.subjectivity.transformer(model='bert', quantized=False, **kwargs)[source]#

Load Transformer subjectivity model.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.subjectivity.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.BinaryBERT.

  • if xlnet in model, will return malaya.model.xlnet.BinaryXLNET.

  • if fastformer in model, will return malaya.model.fastformer.BinaryFastFormer.

Return type

model

malaya.syllable#

malaya.syllable.available_deep_model()[source]#

List available syllable tokenizer deep models.

malaya.syllable.rules(**kwargs)[source]#

Load rules based syllable tokenizer. originally from https://github.com/fahadh4ilyas/syllable_splitter/blob/master/SyllableSplitter.py - improved cuaca double vocal ua based on https://en.wikipedia.org/wiki/Comparison_of_Indonesian_and_Standard_Malay#Syllabification - improved rans double consonant ns based on https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1 - improved au and ai double vocal.

Returns

result

Return type

malaya.syllable.Tokenizer class

malaya.syllable.deep_model(model='base', quantized=False, **kwargs)[source]#

Load LSTM + Bahdanau Attention syllable tokenizer model, BPE level (YouTokenToMe 300 vocab size).

Parameters
  • model (str, optional (default='base')) – Check available models at malaya.syllable.available_deep_model().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.syllable.DeepSyllable class

class malaya.syllable.Tokenizer[source]#
tokenize(string)[source]#

Tokenize string into multiple strings using syllable patterns. Example from https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1/figure/0, ‘cuaca’ -> [‘cua’, ‘ca’] ‘insurans’ -> [‘in’, ‘su’, ‘rans’] ‘praktikal’ -> [‘prak’, ‘ti’, ‘kal’] ‘strategi’ -> [‘stra’, ‘te’, ‘gi’] ‘ayam’ -> [‘a’, ‘yam’] ‘anda’ -> [‘an’, ‘da’] ‘hantu’ -> [‘han’, ‘tu’]

Parameters

string (str) –

Returns

result

Return type

List[str]

class malaya.syllable.DeepSyllable[source]#
tokenize(string, beam_search=False)[source]#

Tokenize string into multiple strings using deep learning.

Parameters
  • string (str) –

  • beam_search (bool, (optional=False)) – If True, use beam search decoder, else use greedy decoder.

Returns

result

Return type

List[str]

malaya.tatabahasa#

malaya.tatabahasa.describe_tagging()[source]#

Describe kesalahan tatabahasa supported. Full description at https://tatabahasabm.tripod.com/tata/salahtata.htm

malaya.tatabahasa.available_transformer()[source]#

List available transformer tagging models.

malaya.tatabahasa.available_huggingface()[source]#

List available huggingface models.

malaya.tatabahasa.transformer(model='base', quantized=False, **kwargs)[source]#

Load Malaya transformer encoder-decoder + tagging model to correct a kesalahan tatabahasa text.

Parameters
  • model (str, optional (default='base')) – Check available models at malaya.tatabahasa.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.Tatabahasa class

malaya.tatabahasa.huggingface(model='mesolitica/finetune-tatabahasa-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to fix kesalahan tatabahasa.

Parameters
  • model (str, optional (default='mesolitica/finetune-tatabahasa-t5-small-standard-bahasa-cased')) – Check available models at malaya.tatabahasa.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Tatabahasa

malaya.tokenizer#

class malaya.tokenizer.Tokenizer[source]#
tokenize(string, lowercase=False)[source]#

Tokenize string into words.

Parameters
  • string (str) –

  • lowercase (bool, optional (default=False)) –

Returns

result

Return type

List[str]

class malaya.tokenizer.SentenceTokenizer[source]#
tokenize(string, minimum_length=5)[source]#

Tokenize string into multiple strings.

Parameters
  • string (str) –

  • minimum_length (int, optional (default=5)) – minimum length to assume a string is a string, default 5 characters.

Returns

result

Return type

List[str]

malaya.toxicity#

malaya.toxicity.available_transformer()[source]#

List available transformer toxicity analysis models.

malaya.toxicity.multinomial(**kwargs)[source]#

Load multinomial toxicity model.

Returns

result

Return type

malaya.model.ml.MultilabelBayes class

malaya.toxicity.transformer(model='xlnet', quantized=False, **kwargs)[source]#

Load Transformer toxicity model.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

    • 'fastformer' - FastFormer BASE parameters.

    • 'tiny-fastformer' - FastFormer TINY parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.SigmoidBERT.

  • if xlnet in model, will return malaya.model.xlnet.SigmoidXLNET.

  • if fastformer in model, will return malaya.model.fastformer.SigmoidFastFormer.

Return type

model

malaya.transformer#

malaya.transformer.available_transformer()[source]#

List available transformer models.

malaya.transformer.available_huggingface()[source]#

List available huggingface models.

malaya.transformer.load(model='electra', pool_mode='last', **kwargs)[source]#

Load transformer model.

Parameters
  • model (str, optional (default='bert')) – Check available models at malaya.transformer.available_transformer().

  • pool_mode (str, optional (default='last')) –

    Model logits architecture supported. Only usable if model in [‘xlnet’, ‘alxlnet’]. Allowed values:

    • 'last' - last of the sequence.

    • 'first' - first of the sequence.

    • 'mean' - mean of the sequence.

    • 'attn' - attention of the sequence.

Returns

result – List of model classes:

  • if bert in model, will return malaya.transformers.bert.Model.

  • if xlnet in model, will return malaya.transformers.xlnet.Model.

  • if albert in model, will return malaya.transformers.albert.Model.

  • if electra in model, will return malaya.transformers.electra.Model.

Return type

model

malaya.transformer.huggingface(model='mesolitica/electra-base-generator-bahasa-cased', force_check=True, **kwargs)[source]#

Load transformer model.

Parameters
  • model (str, optional (default='mesolitica/electra-base-generator-bahasa-cased')) – Check available models at malaya.transformer.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

malaya.true_case#

malaya.true_case.available_transformer()[source]#

List available transformer models.

malaya.true_case.available_huggingface()[source]#

List available huggingface models.

malaya.true_case.transformer(model='base', quantized=False, **kwargs)[source]#

Load transformer encoder-decoder model to True Case.

Parameters
  • model (str, optional (default='base')) – Check available models at malaya.true_case.available_transformer().

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.TrueCase class

malaya.true_case.huggingface(model='mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to true case.

Parameters
  • model (str, optional (default='mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased')) – Check available models at malaya.true_case.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Generator

malaya.true_case.probability(language_model)[source]#

Use language model to True Case.

Parameters

language_model (Callable) – must an object with score method.

Returns

result

Return type

malaya.true_case.TrueCase_LM class

malaya.word2num#

malaya.word2num.word2num(string)[source]#

Translate from string to number, eg ‘kesepuluh’ -> 10.

Parameters

string (str) –

Returns

result

Return type

int / float

malaya.wordvector#

malaya.wordvector.available_wordvector()[source]#

List available transformer models.

malaya.wordvector.load(model='wikipedia', **kwargs)[source]#

Return malaya.wordvector.WordVector object.

Parameters

model (str, optional (default='wikipedia')) – Check available models at malaya.wordvector.available_wordvector().

Returns

  • vocabulary (indices dictionary for vector.)

  • vector (np.array, 2D.)

class malaya.wordvector.WordVector(embed_matrix, dictionary, **kwargs)[source]#
get_vector_by_name(word, soft=False, topn_soft=5)[source]#

get vector based on string.

Parameters
  • word (str) –

  • soft (bool, (default=True)) – if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary.

  • topn_soft (int, (default=5)) – if word not found in dictionary, will returned topn_soft size of similar size using jarowinkler.

Returns

vector

Return type

np.array, 1D

tree_plot(labels, figsize=(7, 7), annotate=True)[source]#

plot a tree plot based on output from calculator / n_closest / analogy.

Parameters
  • labels (list) – output from calculator / n_closest / analogy.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

  • embed (np.array, 2D.)

  • labelled (labels for X / Y axis.)

scatter_plot(labels, centre=None, figsize=(7, 7), plus_minus=25, handoff=5e-05)[source]#

plot a scatter plot based on output from calculator / n_closest / analogy.

Parameters
  • labels (list) – output from calculator / n_closest / analogy

  • centre (str, (default=None)) – centre label, if a str, it will annotate in a red color.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

tsne

Return type

np.array, 2D.

batch_calculator(equations, num_closest=5, return_similarity=False)[source]#

batch calculator parser for word2vec using tensorflow.

Parameters
  • equations (list of str) – Eg, ‘[(mahathir + najib) - rosmah]’

  • num_closest (int, (default=5)) – number of words closest to the result.

Returns

word_list

Return type

list of nearest words

calculator(equation, num_closest=5, metric='cosine', return_similarity=True)[source]#

calculator parser for word2vec.

Parameters
  • equation (str) – Eg, ‘(mahathir + najib) - rosmah’

  • num_closest (int, (default=5)) – number of words closest to the result.

  • metric (str, (default='cosine')) – vector distance algorithm.

  • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.

Returns

word_list

Return type

list of nearest words

batch_n_closest(words, num_closest=5, return_similarity=False, soft=True)[source]#

find nearest words based on a batch of words using Tensorflow.

Parameters
  • words (list) – Eg, [‘najib’,’anwar’]

  • num_closest (int, (default=5)) – number of words closest to the result.

  • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.

  • soft (bool, (default=True)) – if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary.

Returns

word_list

Return type

list of nearest words

n_closest(word, num_closest=5, metric='cosine', return_similarity=True)[source]#

find nearest words based on a word.

Parameters
  • word (str) – Eg, ‘najib’

  • num_closest (int, (default=5)) – number of words closest to the result.

  • metric (str, (default='cosine')) – vector distance algorithm.

  • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.

Returns

word_list

Return type

list of nearest words

analogy(a, b, c, num=1, metric='cosine')[source]#

analogy calculation, vb - va + vc.

Parameters
  • a (str) –

  • b (str) –

  • c (str) –

  • num (int, (default=1)) –

  • metric (str, (default='cosine')) – vector distance algorithm.

Returns

word_list

Return type

list of nearest words.

project_2d(start, end)[source]#

project word2vec into 2d dimension.

Parameters
  • start (int) –

  • end (int) –

Returns

  • embed_2d (TSNE decomposition)

  • word_list (words in between start and end.)

network(word, num_closest=8, depth=4, min_distance=0.5, iteration=300, figsize=(15, 15), node_color='#72bbd0', node_factor=50)[source]#

plot a social network based on word given

Parameters
  • word (str) – centre of social network.

  • num_closest (int, (default=8)) – number of words closest to the node.

  • depth (int, (default=4)) – depth of social network. More deeper more expensive to calculate, big^O(num_closest ** depth).

  • min_distance (float, (default=0.5)) – minimum distance among nodes. Increase the value to increase the distance among nodes.

  • iteration (int, (default=300)) – number of loops to train the social network to fit min_distace.

  • figsize (tuple, (default=(15, 15))) – figure size for plot.

  • node_color (str, (default='#72bbd0')) – color for nodes.

  • node_factor (int, (default=10)) – size factor for depth nodes. Increase this value will increase nodes sizes based on depth.

Returns

G

Return type

networkx graph object

malaya.model.alignment#

class malaya.model.alignment.Eflomal[source]#
align(source, target, model=3, score_model=0, n_samplers=3, length=1.0, null_prior=0.2, lowercase=True, debug=False, **kwargs)[source]#

align text using eflomal, https://github.com/robertostling/eflomal/blob/master/align.py

Parameters
  • source (List[str]) –

  • target (List[str]) –

  • model (int, optional (default=3)) – Model (1 = IBM1, 2 = IBM1+HMM, 3 = IBM1+HMM+fertility).

  • score_model (int, optional (default=0)) – (1 = IBM1, 2 = IBM1+HMM, 3 = IBM1+HMM+fertility).

  • n_samplers (int, optional (default=3)) – Number of independent samplers to run.

  • length (float, optional (default=1.0)) – Relative number of sampling iterations.

  • null_prior (float, optional (default=0.2)) – Prior probability of NULL alignment.

  • lowercase (bool, optional (default=True)) – lowercase during searching priors.

  • debug (bool, optional (default=False)) – debug eflomal binary.

Returns

result

Return type

Dict[List[List[Tuple]]]

class malaya.model.alignment.HuggingFace[source]#
align(source, target, align_layer=8, threshold=0.001)[source]#

align text using softmax output layers.

Parameters
  • source (List[str]) –

  • target (List[str]) –

  • align_layer (int, optional (default=3)) – transformer layer-k to choose for embedding output.

  • threshold (float, optional (default=1e-3)) – minimum probability to assume as alignment.

Returns

result

Return type

List[List[Tuple]]

malaya.model.bert#

class malaya.model.bert.BinaryBERT[source]#
vectorize(strings, method='first')[source]#

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings, add_neutral=True)[source]#

classify list of strings.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[str]

predict_proba(strings, add_neutral=True)[source]#

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[dict[str, float]]

predict_words(string, method='last', bins_size=0.05, visualization=True, **kwargs)[source]#

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • bins_size (float, optional (default=0.05)) – default bins size for word distribution histogram.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

result

Return type

dict

class malaya.model.bert.MulticlassBERT[source]#
vectorize(strings, method='first')[source]#

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

predict_words(string, method='last', bins_size=0.05, visualization=True, **kwargs)[source]#

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • bins_size (float, optional (default=0.05)) – default bins size for word distribution histogram.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

result

Return type

dict

class malaya.model.bert.SigmoidBERT[source]#
vectorize(strings, method='first')[source]#

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[List[str]]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

class malaya.model.bert.SiameseBERT[source]#
vectorize(strings)[source]#

Vectorize list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

predict_proba(strings_left, strings_right)[source]#

calculate similarity for two different batch of texts.

Parameters
  • strings_left (List[str]) –

  • strings_right (List[str]) –

Returns

list

Return type

list of float

heatmap(strings, visualize=True, annotate=True, figsize=(7, 7))[source]#

plot a heatmap based on output from similarity

Parameters
  • strings (list of str) – list of strings.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

result – list of results

Return type

list

class malaya.model.bert.TaggingBERT[source]#
vectorize(string)[source]#

vectorize a string.

Parameters

string (List[str]) –

Returns

result

Return type

np.array

analyze(string)[source]#

Analyze a string.

Parameters

string (str) –

Returns

result

Return type

{‘words’: List[str], ‘tags’: [{‘text’: ‘text’, ‘type’: ‘location’, ‘score’: 1.0, ‘beginOffset’: 0, ‘endOffset’: 1}]}

predict(string)[source]#

Tag a string.

Parameters

string (str) –

Returns

result

Return type

Tuple[str, str]

class malaya.model.bert.DependencyBERT[source]#
vectorize(string)[source]#

vectorize a string.

Parameters

string (List[str]) –

Returns

result

Return type

np.array

predict(string)[source]#

Tag a string.

Parameters

string (str) –

Returns

result

Return type

Tuple

class malaya.model.bert.ZeroshotBERT[source]#
vectorize(strings, labels, method='first')[source]#

vectorize a string.

Parameters
  • strings (List[str]) –

  • labels (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict_proba(strings, labels)[source]#

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • labels (List[str]) –

Returns

list

Return type

list of float

malaya.model.bigbird#

class malaya.model.bigbird.MulticlassBigBird[source]#
vectorize(strings, method='first')[source]#

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

class malaya.model.bigbird.Translation[source]#
greedy_decoder(strings)[source]#

translate list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.bigbird.Summarization[source]#
greedy_decoder(strings, temperature=0.0, postprocess=False, **kwargs)[source]#

Summarize strings using greedy decoder.

Parameters
  • strings (List[str]) –

  • temperature (float, (default=0.0)) – logits * -log(random.uniform) * temperature.

  • postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.

Returns

result

Return type

List[str]

nucleus_decoder(strings, top_p=0.7, temperature=0.1, postprocess=False, **kwargs)[source]#

Summarize strings using nucleus decoder.

Parameters
  • strings (List[str]) –

  • top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p.

  • temperature (float, (default=0.3)) – logits * -log(random.uniform) * temperature.

  • postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.

Returns

result

Return type

List[str]

malaya.model.extractive_summarization#

class malaya.model.extractive_summarization.SKLearn[source]#
word_level(corpus, isi_penting=None, window_size=10, important_words=10, **kwargs)[source]#

Summarize list of strings / string on word level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • window_size (int, (default=10)) – window size for each word.

  • important_words (int, (default=10)) – number of important words.

Returns

dict

Return type

{‘top-words’, ‘cluster-top-words’, ‘score’}

sentence_level(corpus, isi_penting=None, top_k=3, important_words=10, **kwargs)[source]#

Summarize list of strings / string on sentence level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • top_k (int, (default=3)) – number of summarized strings.

  • important_words (int, (default=10)) – number of important words.

Returns

dict

Return type

{‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’}

class malaya.model.extractive_summarization.Doc2Vec[source]#
word_level(corpus, isi_penting=None, window_size=10, aggregation=<function mean>, soft=False, **kwargs)[source]#

Summarize list of strings / string on sentence level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • window_size (int, (default=10)) – window size for each word.

  • aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.

  • soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.

Returns

dict

Return type

{‘score’}

sentence_level(corpus, isi_penting=None, top_k=3, aggregation=<function mean>, soft=False, **kwargs)[source]#

Summarize list of strings / string on sentence level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • top_k (int, (default=3)) – number of summarized strings.

  • aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.

  • soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.

Returns

dict

Return type

{‘summary’, ‘score’}

class malaya.model.extractive_summarization.Encoder[source]#

malaya.model.huggingface#

class malaya.model.huggingface.Generator[source]#
generate(strings, **kwargs)[source]#

Generate texts from the input.

Parameters
  • strings (List[str]) –

  • **kwargs (vector arguments pass to huggingface generate method.) –

Returns

result

Return type

List[str]

malaya.model.ml#

class malaya.model.ml.MulticlassBayes[source]#
predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

class malaya.model.ml.BinaryBayes[source]#
predict(strings, add_neutral=True)[source]#

classify list of strings.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[str]

predict_proba(strings, add_neutral=True)[source]#

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[dict[str, float]]

class malaya.model.ml.MultilabelBayes[source]#
predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[List[str]]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (list) –

Returns

result

Return type

List[dict[str, float]]

malaya.model.pegasus#

class malaya.model.pegasus.Summarization[source]#
greedy_decoder(strings, temperature=0.0, postprocess=False, **kwargs)[source]#

Summarize strings using greedy decoder.

Parameters
  • strings (List[str]) –

  • temperature (float, (default=0.3)) – logits * -log(random.uniform) * temperature.

  • postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.

Returns

result

Return type

List[str]

nucleus_decoder(strings, top_p=0.7, temperature=0.2, postprocess=False, **kwargs)[source]#

Summarize strings using nucleus decoder.

Parameters
  • strings (List[str]) –

  • top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p.

  • temperature (float, (default=0.3)) – logits * -log(random.uniform) * temperature.

  • postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.

Returns

result

Return type

List[str]

malaya.model.rules#

class malaya.model.rules.LanguageDict[source]#
predict(words, acceptable_ms_label=['malay', 'ind'], acceptable_en_label=['eng', 'manglish'], ignore_capital=False, use_is_malay=True, predict_mandarin=False)[source]#

Predict [EN, MS, OTHERS, CAPITAL, NOT_LANG] on word level. This method assumed the string already tokenized.

Parameters
  • words (List[str]) –

  • acceptable_ms_label (List[str], optional (default = ['malay', 'ind'])) – accept labels from language detection model to assume a word is MS.

  • acceptable_en_label (List[str], optional (default = ['eng', 'manglish'])) – accept labels from language detection model to assume a word is EN.

  • ignore_capital (bool, optional (default=False)) – if True, will predict language for capital word.

  • use_is_malay (bool, optional (default=True)) – if True`, will predict MS word using malaya.dictionary.is_malay, else use language detection model.

  • predict_mandarin (bool, optional (default=False)) – if True, will slide the string to match pinyin dict.

Returns

result

Return type

List[str]

malaya.model.t5#

class malaya.model.t5.Summarization[source]#
greedy_decoder(strings, postprocess=False, **kwargs)[source]#

Summarize strings.

Parameters
  • strings (List[str]) –

  • postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.

Returns

result

Return type

List[str]

class malaya.model.t5.Generator[source]#
greedy_decoder(strings)[source]#

generate a long text given a isi penting. Decoder is greedy decoder with beam width size 1, alpha 0.5 .

Parameters

strings (List[str]) –

Returns

result

Return type

str

class malaya.model.t5.Paraphrase[source]#
greedy_decoder(strings)[source]#

paraphrase strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.t5.Spell[source]#
greedy_decoder(strings)[source]#

spelling correction for strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.t5.Segmentation[source]#
greedy_decoder(strings)[source]#

text segmentation.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

malaya.model.tf#

class malaya.model.tf.DeepLang[source]#
predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

class malaya.model.tf.Translation[source]#
greedy_decoder(strings)[source]#

translate list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings, beam_size=3, temperature=0.5)[source]#

translate list of strings using beam decoder. Currently only noisy models supported beam_size and temperature parameters.

Parameters
  • strings (List[str]) –

  • beam_size (int, optional (default=3)) –

  • temperature (float, optional (default=0.5)) –

Returns

result

Return type

List[str]

class malaya.model.tf.Constituency[source]#
vectorize(string)[source]#

vectorize a string.

Parameters

string (List[str]) –

Returns

result

Return type

np.array

parse_nltk_tree(string)[source]#

Parse a string into NLTK Tree, to make it useful, make sure you already installed tktinker.

Parameters

string (str) –

Returns

result

Return type

nltk.Tree object

parse_tree(string)[source]#

Parse a string into string treebank format.

Parameters

string (str) –

Returns

result

Return type

malaya.text.trees.InternalTreebankNode class

class malaya.model.tf.TrueCase[source]#
greedy_decoder(strings)[source]#

True case strings using greedy decoder. Example, “saya nak makan di us makanan di sana sedap” -> “Saya nak makan di US, makanan di sana sedap.”

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings)[source]#

True case strings using beam decoder, beam width size 3, alpha 0.5 . Example, “saya nak makan di us makanan di sana sedap” -> “Saya nak makan di US, makanan di sana sedap.”

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.tf.Segmentation[source]#
greedy_decoder(strings)[source]#

Segment strings using greedy decoder. Example, “sayasygkan negarasaya” -> “saya sygkan negara saya”

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings)[source]#

Segment strings using beam decoder, beam width size 3, alpha 0.5 . Example, “sayasygkan negarasaya” -> “saya sygkan negara saya”

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.tf.Paraphrase[source]#
greedy_decoder(strings, **kwargs)[source]#

Paraphrase strings using greedy decoder.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings, **kwargs)[source]#

Paraphrase strings using beam decoder, beam width size 3, alpha 0.5 .

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

nucleus_decoder(strings, top_p=0.7, **kwargs)[source]#

Paraphrase strings using nucleus sampling.

Parameters
  • strings (List[str]) –

  • top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p.

Returns

result

Return type

List[str]

class malaya.model.tf.Tatabahasa[source]#
greedy_decoder(strings)[source]#

Fix kesalahan tatatabahasa.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.tf.SQUAD[source]#
predict(paragraph_text, question_texts, doc_stride=128, max_query_length=64, max_answer_length=64, n_best_size=20)[source]#

Predict Span from questions given a paragraph.

Parameters
  • paragraph_text (str) –

  • question_texts (List[str]) – List of questions, results really depends on case sensitive questions.

  • doc_stride (int, optional (default=128)) – striding size to split a paragraph into multiple texts.

  • max_query_length (int, optional (default=64)) – Maximum length if question tokens.

  • max_answer_length (int, optional (default=30)) – Maximum length if answer tokens.

Returns

result

Return type

List[{‘text’: ‘text’, ‘start’: 0, ‘end’: 1}]

vectorize(strings, method='first')[source]#

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

class malaya.model.tf.GPT2[source]#
generate(string, maxlen=256, n_samples=1, temperature=1.0, top_k=0, top_p=0.0)[source]#

generate a text given an initial string.

Parameters
  • string (str) –

  • maxlen (int, optional (default=256)) – length of sentence to generate.

  • n_samples (int, optional (default=1)) – size of output.

  • temperature (float, optional (default=1.0)) – temperature value, value should between 0 and 1.

  • top_k (int, optional (default=0)) – top-k in nucleus sampling selection.

  • top_p (float, optional (default=0.0)) – top-p in nucleus sampling selection, value should between 0 and 1. if top_p == 0, will use top_k. if top_p == 0 and top_k == 0, use greedy decoder.

Returns

result

Return type

List[str]

class malaya.model.tf.Seq2SeqLSTM[source]#
greedy_decoder(strings)[source]#

Convert to target strings using greedy decoder.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings)[source]#

Convert to target strings using beam decoder.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict(strings, beam_search=False)[source]#

Convert to target strings.

Parameters
  • strings (List[str]) –

  • beam_search (bool, (optional=False)) – If True, use beam search decoder, else use greedy decoder.

Returns

result

Return type

List[str]

class malaya.model.tf.JawiRumi[source]#
greedy_decoder(strings)[source]#

Convert list of jawi strings to rumi strings. ‘ايسو بيل تنب دباوا ك كابينيت - صيفالدين’ -> ‘isu bil tnb dibawa ke kabinet - saifuddin’

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings, beam_size=3, temperature=0.5)[source]#

Convert list of jawi strings to rumi strings. ‘ايسو بيل تنب دباوا ك كابينيت - صيفالدين’ -> ‘isu bil tnb dibawa ke kabinet - saifuddin’

Parameters
  • strings (List[str]) –

  • beam_size (int, optional (default=3)) –

  • temperature (float, optional (default=0.5)) –

Returns

result

Return type

List[str]

class malaya.model.tf.RumiJawi[source]#
greedy_decoder(strings)[source]#

Convert list of jawi strings to rumi strings. ‘isu bil tnb dibawa ke kabinet - saifuddin’ -> ‘ايسو بيل تنب دباوا ك كابينيت - صيفالدين’

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings, beam_size=3, temperature=0.5)[source]#

Convert list of jawi strings to rumi strings. ‘isu bil tnb dibawa ke kabinet - saifuddin’ -> ‘ايسو بيل تنب دباوا ك كابينيت - صيفالدين’

Parameters
  • strings (List[str]) –

  • beam_size (int, optional (default=3)) –

  • temperature (float, optional (default=0.5)) –

Returns

result

Return type

List[str]

malaya.model.xlnet#

class malaya.model.xlnet.BinaryXLNET[source]#
vectorize(strings, method='first')[source]#

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings, add_neutral=True)[source]#

classify list of strings.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[str]

predict_proba(strings, add_neutral=True)[source]#

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[dict[str, float]]

predict_words(string, method='last', bins_size=0.05, visualization=True, **kwargs)[source]#

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • bins_size (float, optional (default=0.05)) – default bins size for word distribution histogram.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

result

Return type

dict

class malaya.model.xlnet.MulticlassXLNET[source]#
vectorize(strings, method='first')[source]#

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

predict_words(string, method='last', bins_size=0.05, visualization=True, **kwargs)[source]#

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • bins_size (float, optional (default=0.05)) – default bins size for word distribution histogram.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

result

Return type

dict

class malaya.model.xlnet.SigmoidXLNET[source]#
vectorize(strings, method='first')[source]#

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[List[str]]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

predict_words(string, method='last', bins_size=0.05, visualization=True, **kwargs)[source]#

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • bins_size (float, optional (default=0.05)) – default bins size for word distribution histogram.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

dictionary

Return type

results

class malaya.model.xlnet.SiameseXLNET[source]#
vectorize(strings)[source]#

Vectorize list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

predict_proba(strings_left, strings_right)[source]#

calculate similarity for two different batch of texts.

Parameters
  • string_left (List[str]) –

  • string_right (List[str]) –

Returns

result

Return type

List[float]

heatmap(strings, visualize=True, annotate=True, figsize=(7, 7))[source]#

plot a heatmap based on output from similarity

Parameters
  • strings (list of str) – list of strings.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

result – list of results

Return type

list

class malaya.model.xlnet.TaggingXLNET[source]#
vectorize(string)[source]#

vectorize a string.

Parameters

string (List[str]) –

Returns

result

Return type

np.array

analyze(string)[source]#

Analyze a string.

Parameters

string (str) –

Returns

result

Return type

{‘words’: List[str], ‘tags’: [{‘text’: ‘text’, ‘type’: ‘location’, ‘score’: 1.0, ‘beginOffset’: 0, ‘endOffset’: 1}]}

predict(string)[source]#

Tag a string.

Parameters

string (str) –

Returns

result

Return type

Tuple[str, str]

class malaya.model.xlnet.DependencyXLNET[source]#
vectorize(string)[source]#

vectorize a string.

Parameters

string (List[str]) –

Returns

result

Return type

np.array

predict(string)[source]#

Tag a string.

Parameters

string (str) –

Returns

result

Return type

Tuple

class malaya.model.xlnet.ZeroshotXLNET[source]#
vectorize(strings, labels, method='first')[source]#

vectorize a string.

Parameters
  • strings (List[str]) –

  • labels (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict_proba(strings, labels)[source]#

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • labels (List[str]) –

Returns

list

Return type

list of float

malaya.torch_model.gpt2_lm#

class malaya.torch_model.gpt2_lm.LM[source]#
score(string, log=True, reduce='prod')[source]#

score a string.

Parameters
  • string (str) –

  • log (bool, optional (default=True)) – return in log, else in exponent.

  • reduce (str, optional (default='prod')) – aggregate function.

Returns

result

Return type

float

malaya.torch_model.huggingface#

class malaya.torch_model.huggingface.Generator[source]#
generate(strings, return_generate=False, **kwargs)[source]#

Generate texts from the input.

Parameters
Returns

result

Return type

List[str]

alignment(source, target)[source]#

align texts using cross attention and dtw-python.

Parameters
  • source (List[str]) –

  • target (List[str]) –

Returns

result

Return type

Dict

class malaya.torch_model.huggingface.Prefix[source]#
generate(string, **kwargs)[source]#

Generate texts from the input.

Parameters
Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Paraphrase[source]#
generate(strings, postprocess=True, **kwargs)[source]#

Generate texts from the input.

Parameters
Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Summarization[source]#
generate(strings, postprocess=True, n=2, threshold=0.1, reject_similarity=0.85, **kwargs)[source]#

Generate texts from the input.

Parameters
  • strings (List[str]) –

  • postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed biased generated international news publisher.

  • n (int, optional (default=2)) – N size of rouge to filter

  • threshold (float, optional (default=0.1)) – minimum threshold for N rouge score to select a sentence.

  • reject_similarity (float, optional (default=0.85)) – reject similar sentences while maintain position.

  • **kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Similarity[source]#
predict_proba(strings_left, strings_right)[source]#

calculate similarity for two different batch of texts.

Parameters
  • strings_left (List[str]) –

  • strings_right (List[str]) –

Returns

list

Return type

List[float]

class malaya.torch_model.huggingface.ZeroShotClassification[source]#
predict_proba(strings, labels, prefix='ayat ini berkaitan tentang', multilabel=True)[source]#

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • labels (List[str]) –

  • prefix (str, optional (default='ayat ini berkaitan tentang')) – prefix of labels to zero shot. Playing around with prefix can get better results.

  • multilabel (bool, optional (default=True)) – probability of labels can be more than 1.0

Returns

list

Return type

List[Dict[str, float]]

class malaya.torch_model.huggingface.ZeroShotNER[source]#
predict(string, tags, minimum_length=2, **kwargs)[source]#

classify entities in a string.

Parameters
Returns

list

Return type

Dict[str, List[str]]

class malaya.torch_model.huggingface.ExtractiveQA[source]#
predict(paragraph_text, question_texts, validate_answers=True, validate_questions=False, minimum_threshold_question=0.05, **kwargs)[source]#

Predict extractive answers from questions given a paragraph.

Parameters
  • paragraph_text (str) –

  • question_texts (List[str]) – List of questions, results really depends on case sensitive questions.

  • validate_answers (bool, optional (default=True)) – if True, will check the answer is inside the paragraph.

  • validate_questions (bool, optional (default=False)) – if True, validate the question is subset of the paragraph using sklearn.feature_extraction.text.CountVectorizer it is only useful if paragraph_text and question_texts are the same language.

  • minimum_threshold_question (float, optional (default=0.05)) – minimum score from cosine_similarity, only useful if validate_questions = True.

  • **kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Transformer[source]#
vectorize(strings, method='last', method_token='first', t5_head_logits=True, **kwargs)[source]#

Vectorize string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    hidden layers supported. Allowed values:

    • 'last' - last layer.

    • 'first' - first layer.

    • 'mean' - average all layers.

    This only applicable for non T5 models.

  • method_token (str, optional (default='first')) –

    token layers supported. Allowed values:

    • 'last' - last token.

    • 'first' - first token.

    • 'mean' - average all tokens.

    usually pretrained models trained on first token for classification task. This only applicable for non T5 models.

  • t5_head_logits (str, optional (default=True)) – if True, will take head logits, else, last token. This only applicable for T5 models.

Returns

result

Return type

np.array

attention(strings, method='last', method_head='mean', t5_attention='cross_attentions', **kwargs)[source]#

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • method_head (str, optional (default='mean')) –

    attention head layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • t5_attention (str, optional (default='cross_attentions')) –

    attention type for T5 models. Allowed values:

    • 'cross_attentions' - cross attention.

    • 'encoder_attentions' - encoder attention.

    • 'decoder_attentions' - decoder attention.

    This only applicable for T5 models.

Returns

result

Return type

List[List[Tuple[str, float]]]

class malaya.torch_model.huggingface.IsiPentingGenerator[source]#
generate(strings, mode='surat-khabar', remove_html_tags=True, **kwargs)[source]#

generate a long text given a isi penting.

Parameters
  • strings (List[str]) –

  • mode (str, optional (default='surat-khabar')) –

    Mode supported. Allowed values:

    • 'surat-khabar' - news style writing.

    • 'tajuk-surat-khabar' - headline news style writing.

    • 'artikel' - article style writing.

    • 'penerangan-produk' - product description style writing.

    • 'karangan' - karangan sekolah style writing.

  • remove_html_tags (bool, optional (default=True)) – Will remove html tags using malaya.text.function.remove_html_tags.

  • **kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Tatabahasa[source]#
generate(strings, **kwargs)[source]#

Fix kesalahan tatatabahasa.

Parameters
Returns

result

Return type

List[Tuple[str, int]]

class malaya.torch_model.huggingface.Normalizer[source]#
generate(strings, **kwargs)[source]#

abstractive text normalization.

Parameters
Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Keyword[source]#
generate(strings, top_keywords=5, **kwargs)[source]#

Generate texts from the input.

Parameters
Returns

result

Return type

List[str]

malaya.torch_model.mask_lm#

class malaya.torch_model.mask_lm.MLMScorer[source]#
score(string)[source]#

score a string.

Parameters

string (str) –

Returns

result

Return type

float

malaya.transformers.albert#

malaya.transformers.albert.load(model='albert', **kwargs)[source]#

Load albert model.

Parameters

model (str, optional (default='base')) –

Model architecture supported. Allowed values:

  • 'albert' - base albert-bahasa released by Malaya.

  • 'tiny-albert' - tiny bert-bahasa released by Malaya.

Returns

result

Return type

malaya.transformers.albert.Model class

class malaya.transformers.albert.Model[source]#
vectorize(strings, **kwargs)[source]#

Vectorize string inputs.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

attention(strings, method='last', **kwargs)[source]#

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

Returns

result

Return type

List[List[Tuple[str, float]]]

visualize_attention(string)[source]#

Visualize attention.

Parameters

string (str) –

malaya.transformers.alxlnet#

malaya.transformers.alxlnet.load(model='alxlnet', pool_mode='last', **kwargs)[source]#

Load alxlnet model.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'alxlnet' - XLNET architecture from google + Malaya.

  • pool_mode (str, optional (default='last')) –

    Model logits architecture supported. Allowed values:

    • 'last' - last of the sequence.

    • 'first' - first of the sequence.

    • 'mean' - mean of the sequence.

    • 'attn' - attention of the sequence.

Returns

result

Return type

malaya.transformers.alxlnet.Model class

class malaya.transformers.alxlnet.Model[source]#
vectorize(strings, **kwargs)[source]#

Vectorize string inputs.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

attention(strings, method='last', **kwargs)[source]#

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

Returns

result

Return type

List[List[Tuple[str, float]]]

visualize_attention(string)[source]#

Visualize attention.

Parameters

string (str) –

malaya.transformers.bert#

malaya.transformers.bert.load(model='base', **kwargs)[source]#

Load bert model.

Parameters

model (str, optional (default='base')) –

Model architecture supported. Allowed values:

  • 'bert' - base bert-bahasa released by Malaya.

  • 'tiny-bert' - tiny bert-bahasa released by Malaya.

Returns

result

Return type

malaya.transformers.bert.Model class

class malaya.transformers.bert.Model[source]#
vectorize(strings, **kwargs)[source]#

Vectorize string inputs.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

attention(strings, method='last', **kwargs)[source]#

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

Returns

result

Return type

List[List[Tuple[str, float]]]

visualize_attention(string)[source]#

Visualize attention.

Parameters

string (str) –

malaya.transformers.electra#

malaya.transformers.electra.load(model='electra', **kwargs)[source]#

Load electra model.

Parameters

model (str, optional (default='base')) –

Model architecture supported. Allowed values:

  • 'electra' - base electra-bahasa released by Malaya.

  • 'small-electra' - small electra-bahasa released by Malaya.

Returns

result

Return type

malaya.transformers.electra.Model class

class malaya.transformers.electra.Model[source]#
vectorize(strings, **kwargs)[source]#

Vectorize string inputs.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

attention(strings, method='last', **kwargs)[source]#

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

Returns

result

Return type

List[List[Tuple[str, float]]]

visualize_attention(string)[source]#

Visualize attention.

Parameters

string (str) –

malaya.transformers.xlnet#

malaya.transformers.xlnet.load(model='xlnet', pool_mode='last', **kwargs)[source]#

Load xlnet model.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'xlnet' - XLNET architecture from google.

  • pool_mode (str, optional (default='last')) –

    Model logits architecture supported. Allowed values:

    • 'last' - last of the sequence.

    • 'first' - first of the sequence.

    • 'mean' - mean of the sequence.

    • 'attn' - attention of the sequence.

Returns

result

Return type

malaya.transformers.xlnet.Model class

class malaya.transformers.xlnet.Model[source]#
vectorize(strings, **kwargs)[source]#

Vectorize string inputs.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

attention(strings, method='last', **kwargs)[source]#

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

Returns

result

Return type

List[List[Tuple[str, float]]]

visualize_attention(string)[source]#

Visualize attention.

Parameters

string (str) –