API#

malaya#

malaya.augmentation.abstractive#

malaya.augmentation.abstractive.huggingface(model='mesolitica/translation-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to abstractive text augmentation.

Parameters
  • model (str, optional (default='mesolitica/translation-t5-small-standard-bahasa-cased')) – Check available models at malaya.augmentation.abstractive.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Translation

malaya.augmentation.rules#

malaya.augmentation.rules.synonym(string, threshold=0.5, top_n=5, **kwargs)[source]#

augmenting a string using synonym, https://github.com/huseinzol05/Malaya-Dataset#90k-synonym

Parameters
  • string (str) – this string input assumed been properly tokenized and cleaned.

  • threshold (float, optional (default=0.5)) – random selection for a word.

  • top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.

Returns

result

Return type

List[str]

malaya.augmentation.rules.replace_similar_consonants(word, threshold=0.5, replace_consonants={'b': ['n'], 'd': ['s', 'f'], 'f': ['p'], 'g': ['f', 'h'], 'j': ['k'], 'k': ['l'], 'n': ['m'], 'r': ['t', 'q']})[source]#

Naively replace consonants with another consonants to simulate typo or slang if after consonants is a vowel.

Parameters
  • word (str) –

  • threshold (float, optional (default=0.5)) –

Returns

result

Return type

List[str]

malaya.augmentation.rules.replace_similar_vowels(word, threshold=0.5, replace_vowels={'a': ['o'], 'i': ['o'], 'o': ['u'], 'u': ['o']})[source]#

Naively replace vowels with another vowels to simulate typo or slang if after vowels is a consonant.

Parameters
  • word (str) –

  • threshold (float, optional (default=0.5)) –

Returns

result

Return type

str

malaya.augmentation.rules.socialmedia_form(word)[source]#

augmenting a word into socialmedia form.

Parameters

word (str) –

Returns

result

Return type

List[str]

malaya.augmentation.rules.vowel_alternate(word, threshold=0.5)[source]#

augmenting a word into vowel alternate.

vowel_alternate(‘singapore’) -> sngpore

vowel_alternate(‘kampung’) -> kmpng

vowel_alternate(‘ayam’) -> aym

Parameters
  • word (str) –

  • threshold (float, optional (default=0.5)) –

Returns

result

Return type

str

malaya.augmentation.rules.kelantanese_form(word)[source]#

augmenting a word into kelantanese form. ayam -> ayom otak -> otok kakak -> kakok

barang -> bare kembang -> kembe nyarang -> nyare

Parameters

word (str) –

Returns

result

Return type

List[str]

malaya.dictionary#

malaya.dictionary.keyword_wiktionary(word, acceptable_lang=['brunei malay', 'malay'])[source]#

crawl https://en.wiktionary.org/wiki/ to check a word is a malay word.

Parameters
  • word (str) –

  • acceptable_lang (List[str], optional (default=['brunei malay', 'malay'])) – acceptable languages in wiktionary section.

Returns

result

Return type

Dict

malaya.dictionary.keyword_dbp(word, parse=False)[source]#

crawl https://prpm.dbp.gov.my/cari1?keyword= to check a word is a malay word.

Parameters
  • word (str) –

  • parse (bool, optional (default=False)) – if True, will parse using BeautifulSoup.

Returns

result

Return type

Dict

malaya.dictionary.corpus_dbp(word)[source]#

crawl http://sbmb.dbp.gov.my/korpusdbp/Search2.aspx to search corpus based on a word.

Parameters

word (str) –

Returns

result

Return type

pandas.core.frame.DataFrame

malaya.dictionary.is_english(word)[source]#

Check a word is an english word.

Parameters

word (str) –

Returns

result

Return type

bool

malaya.dictionary.is_malay(word, stemmer=None)[source]#

Check a word is a malay word.

Parameters
  • word (str) –

  • stemmer (Callable, optional (default=None)) – a Callable object, must have stem_word method.

Returns

result

Return type

bool

malaya.dictionary.convert_pinyin(string)[source]#

Convert mandarin characters to pinyin form. Original vocab from https://github.com/lxyu/pinyin 你好 -> ni hao

Parameters

string (str) –

Returns

result

Return type

str

malaya.generator.isi_penting#

malaya.generator.isi_penting.huggingface(model='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to generate text based on isi penting.

Parameters
  • model (str, optional (default='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased')) – Check available models at malaya.generator.isi_penting.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.IsiPentingGenerator

malaya.keyword.abstractive#

malaya.keyword.abstractive.huggingface(model='mesolitica/finetune-keyword-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to abstractive keyword.

Parameters
  • model (str, optional (default='mesolitica/finetune-keyword-t5-small-standard-bahasa-cased')) – Check available models at malaya.keyword.abstractive.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Keyword

malaya.keyword.extractive#

malaya.keyword.extractive.rake(string, vocab=None, model=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]#

Extract keywords using Rake algorithm.

Parameters
  • string (str) –

  • vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.

  • model (Object, optional (default=None)) – model must has attention method.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] For automatic Ngram generator.

Returns

result

Return type

Tuple[float, str]

malaya.keyword.extractive.textrank(string, vocab=None, model=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]#

Extract keywords using Textrank algorithm.

Parameters
  • string (str) –

  • vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.

  • model (Object, optional (default='None')) – model must has fit_transform or vectorize method.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

Tuple[float, str]

malaya.keyword.extractive.attention(string, model, vocab=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]#

Extract keywords using Attention mechanism.

Parameters
  • string (str) –

  • model (Object) – model must has attention method.

  • vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

Tuple[float, str]

malaya.keyword.extractive.similarity(string, model, vocab=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]#

Extract keywords using Sentence embedding VS keyword embedding similarity.

Parameters
  • string (str) –

  • model (Object) – Transformer model or any model has vectorize method.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

Tuple[float, str]

malaya.normalizer.rules#

malaya.normalizer.rules.load(speller=None, stemmer=None, **kwargs)[source]#

Load a Normalizer using any spelling correction model.

Parameters
  • speller (Callable, optional (default=None)) – function to correct spelling, must have correct or normalize_elongated method.

  • stemmer (Callable, optional (default=None)) – function to stem, must have stem_word method. If provide stemmer, will accurately to stem kata imbuhan akhir.

Returns

result

Return type

malaya.normalizer.rules.Normalizer class

class malaya.normalizer.rules.Normalizer[source]#
normalize(string, normalize_text=True, normalize_url=False, normalize_email=False, normalize_year=True, normalize_telephone=True, normalize_date=True, normalize_time=True, normalize_emoji=True, normalize_elongated=True, normalize_hingga=True, normalize_pada_hari_bulan=True, normalize_fraction=True, normalize_money=True, normalize_units=True, normalize_percent=True, normalize_ic=True, normalize_number=True, normalize_x_kali=True, normalize_cardinal=True, normalize_ordinal=True, normalize_entity=True, expand_contractions=True, check_english_func=<function is_english>, check_malay_func=<function is_malay>, translator=None, language_detection_word=None, acceptable_language_detection=['EN', 'CAPITAL', 'NOT_LANG'], segmenter=None, text_scorer=None, text_scorer_window=2, not_a_word_threshold=0.0001, dateparser_settings={'TIMEZONE': 'GMT+8'}, **kwargs)[source]#

Normalize a string.

Parameters
  • string (str) –

  • normalize_text (bool, optional (default=True)) – if True, will try to replace shortforms with internal corpus.

  • normalize_url (bool, optional (default=False)) – if True, replace :// with empty and . with dot. https://huseinhouse.com -> https huseinhouse dot com.

  • normalize_email (bool, optional (default=False)) – if True, replace @ with di, . with dot. husein.zol05@gmail.com -> husein dot zol kosong lima di gmail dot com.

  • normalize_year (bool, optional (default=True)) – if True, tahun 1987 -> tahun sembilan belas lapan puluh tujuh. if True, 1970-an -> sembilan belas tujuh puluh an. if False, tahun 1987 -> tahun seribu sembilan ratus lapan puluh tujuh.

  • normalize_telephone (bool, optional (default=True)) – if True, no 012-1234567 -> no kosong satu dua, satu dua tiga empat lima enam tujuh

  • normalize_date (bool, optional (default=True)) – if True, 01/12/2001 -> satu disember dua ribu satu. if True, Jun 2017 -> satu Jun dua ribu tujuh belas. if True, 2017 Jun -> satu Jun dua ribu tujuh belas. if False, 2017 Jun -> 01/06/2017. if False, Jun 2017 -> 01/06/2017.

  • normalize_time (bool, optional (default=True)) – if True, pukul 2.30 -> pukul dua tiga puluh minit. if False, pukul 2.30 -> ‘02:00:00’

  • normalize_emoji (bool, (default=True)) – if True, 🔥 -> emoji api Load from malaya.preprocessing.demoji.

  • normalize_elongated (bool, optional (default=True)) – if True, betuii -> betui.

  • normalize_hingga (bool, optional (default=True)) – if True, 2011 - 2019 -> dua ribu sebelas hingga dua ribu sembilan belas

  • normalize_pada_hari_bulan (bool, optional (default=True)) – if True, pada 10/4 -> pada sepuluh hari bulan empat

  • normalize_fraction (bool, optional (default=True)) – if True, 10 /4 -> sepuluh per empat

  • normalize_money (bool, optional (default=True)) – if True, rm10.4m -> sepuluh juta empat ratus ribu ringgit

  • normalize_units (bool, optional (default=True)) – if True, 61.2 kg -> enam puluh satu perpuluhan dua kilogram

  • normalize_percent (bool, optional (default=True)) – if True, 0.8% -> kosong perpuluhan lapan peratus

  • normalize_ic (bool, optional (default=True)) – if True, 911111-01-1111 -> sembilan satu satu satu satu satu sempang kosong satu sempang satu satu satu satu

  • normalize_number (bool, optional (default=True)) – if True 0123 -> kosong satu dua tiga

  • normalize_x_kali (bool, optional (default=True)) – if True 10x -> ‘sepuluh kali’

  • normalize_cardinal (bool, optional (default=True)) – if True, 123 -> seratus dua puluh tiga

  • normalize_ordinal (bool, optional (default=True)) – if True, ke-123 -> keseratus dua puluh tiga

  • normalize_entity (bool, optional (default=True)) – normalize entities, only effect date, datetime, time and money patterns string only.

  • expand_contractions (bool, optional (default=True)) – expand english contractions.

  • check_english_func (Callable, optional (default=malaya.text.function.is_english)) – function to check a word in english dictionary, default is malaya.text.function.is_english. this parameter also will be use for malay text normalization.

  • check_malay_func (Callable, optional (default=malaya.text.function.is_malay)) – function to check a word in malay dictionary, default is malaya.text.function.is_malay.

  • translator (Callable, optional (default=None)) – function to translate EN word to MS word.

  • language_detection_word (Callable, optional (default=None)) – function to detect language for each words to get better translation results.

  • acceptable_language_detection (List[str], optional (default=['EN', 'CAPITAL', 'NOT_LANG'])) – only translate substrings if the results from language_detection_word is in acceptable_language_detection.

  • segmenter (Callable, optional (default=None)) – function to segmentize word. If provide, it will expand a word, apaitu -> apa itu

  • text_scorer (Callable, optional (default=None)) – function to validate upper word. If lower case score is higher or equal than upper case score, will choose lower case.

  • text_scorer_window (int, optional (default=2)) – size of lookback and lookforward to validate upper word.

  • not_a_word_threshold (float, optional (default=1e-4)) – assume a word is not a human word if score lower than not_a_word_threshold. only usable if passed text_scorer parameter.

  • dateparser_settings (Dict, optional (default={'TIMEZONE': 'GMT+8'})) – default dateparser setting, check support settings at https://dateparser.readthedocs.io/en/latest/

Returns

result

Return type

{‘normalize’, ‘date’, ‘money’}

malaya.qa.extractive#

malaya.qa.extractive.huggingface(model='mesolitica/finetune-qa-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to answer extractive question answers.

Parameters
  • model (str, optional (default='mesolitica/finetune-qa-t5-small-standard-bahasa-cased')) – Check available models at malaya.qa.extractive.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.ExtractiveQA

malaya.similarity.doc2vec#

malaya.similarity.doc2vec.vectorizer(v)[source]#

Doc2vec interface for text similarity using Encoder model.

Parameters

v (object) – encoder interface object, BERT, XLNET. should have vectorize method.

Returns

result

Return type

malaya.similarity.doc2vec.VectorizerSimilarity

class malaya.similarity.doc2vec.VectorizerSimilarity[source]#
predict_proba(left_strings, right_strings, similarity='cosine')[source]#

calculate similarity for two different batch of texts.

Parameters
  • left_strings (list of str) –

  • right_strings (list of str) –

  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.

    • 'euclidean' - euclidean similarity.

    • 'manhattan' - manhattan similarity.

Returns

result

Return type

List[float]

heatmap(strings, similarity='cosine', visualize=True, annotate=True, figsize=(7, 7))[source]#

plot a heatmap based on output from bert similarity.

Parameters
  • strings (list of str) – list of strings.

  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.

    • 'euclidean' - euclidean similarity.

    • 'manhattan' - manhattan similarity.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

result – list of results

Return type

list

malaya.similarity.semantic#

malaya.spelling_correction.jamspell#

malaya.spelling_correction.jamspell.load(model='wiki', **kwargs)[source]#

Load a jamspell Spell Corrector for Malay.

Parameters

model (str, optional (default='wiki+news')) –

Supported models. Allowed values:

  • 'wiki+news' - Wikipedia + News, 337MB.

  • 'wiki' - Wikipedia, 148MB.

  • 'news' - local news, 215MB.

Returns

result

Return type

malaya.spell.JamSpell class

class malaya.spelling_correction.jamspell.JamSpell[source]#
correct(word, string, index=- 1)[source]#

Correct a word within a text, returning the corrected word.

Parameters
  • word (str) –

  • string (List[str]) – Tokenized string, word must a word inside string.

  • index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).

Returns

result

Return type

str

correct_word(word, string, index=- 1)[source]#

Spell-correct word in re.match, and preserve proper upper, lower, title case.

correct_match(match, string, index=- 1)[source]#

Spell-correct word in re.match, and preserve proper upper, lower, title case.

correct_text(text)[source]#

Correct all the words within a text, returning the corrected text.

Parameters

text (str) –

Returns

result

Return type

str

edit_candidates(word, string, index=- 1)[source]#

Generate candidates given a word.

Parameters
  • word (str) –

  • string (str) – Entire string, word must a word inside string.

  • index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).

Returns

result

Return type

List[str]

malaya.spelling_correction.probability#

class malaya.spelling_correction.probability.Probability(corpus, sp_tokenizer=None, stemmer=None, **kwargs)[source]#

The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.

P(word)[source]#

Probability of word.

correct(word, score_func=None, **kwargs)[source]#

Most probable spelling correction for word.

Parameters

word (str) –

Returns

result

Return type

str

class malaya.spelling_correction.probability.ProbabilityLM(language_model, corpus, sp_tokenizer=None, stemmer=None, **kwargs)[source]#

The SpellCorrector extends the functionality of the Peter Norvig’s with Language Model. spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.

correct(word, string, index=- 1, lookback=3, lookforward=3, **kwargs)[source]#

Correct a word within a text, returning the corrected word.

Parameters
  • word (str) –

  • string (List[str]) – Entire string, word must a word inside string.

  • index (int, optional (default=-1)) – index of word in the string, if -1, will try to use string.index(word).

  • lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

Returns

result

Return type

str

correct_text(text, lookback=3, lookforward=3)[source]#

Correct all the words within a text, returning the corrected text.

Parameters
  • text (str) –

  • lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

Returns

result

Return type

str

correct_word(word, string, index=- 1, lookback=3, lookforward=3)[source]#

Spell-correct word, and preserve proper upper, lower and title case.

Parameters
  • word (str) –

  • string (List[str]) – Tokenized string, word must a word inside string.

  • index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).

  • lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

Returns

result

Return type

str

correct_match(match, string, index=- 1, lookback=3, lookforward=3)[source]#

Spell-correct word in re.match, and preserve proper upper, lower, title case.

malaya.spelling_correction.probability.load(language_model=None, sentence_piece=False, stemmer=None, additional_words={'la': 100000, 'ni': 100000, 'pun': 100000}, **kwargs)[source]#

Load a Probability Spell Corrector.

Parameters
  • language_model (Callable, optional (default=None)) – If not None, must an object with score method.

  • sentence_piece (bool, optional (default=False)) – if True, reduce possible augmentation states using sentence piece.

  • stemmer (Callable, optional (default=None)) – a Callable object, must have stem_word method.

  • additional_words (Dict[str, int], (default={'ni': 100000, 'pun': 100000, 'la': 100000})) – additional bias vocab.

Returns

result – List of model classes:

  • if passed language_model will return malaya.spelling_correction.probability.ProbabilityLM.

  • else will return malaya.spelling_correction.probability.Probability.

Return type

model

class malaya.spelling_correction.probability.Spell[source]#
edit_step(word)[source]#

Generate possible combination of an input.

edits2(word)[source]#

All edits that are two edits away from word.

known(words)[source]#

The subset of words that appear in the dictionary of WORDS.

edit_candidates(word)[source]#

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

List[str]

correct_text(text)[source]#

Correct all the words within a text, returning the corrected text.

Parameters

text (str) –

Returns

result

Return type

str

correct_match(match)[source]#

Spell-correct word in re.match, and preserve proper upper, lower, title case.

correct_word(word)[source]#

Spell-correct word, and preserve proper upper, lower and title case.

Parameters

word (str) –

Returns

result

Return type

str

class malaya.spelling_correction.probability.Probability[source]#

The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.

P(word)[source]#

Probability of word.

correct(word, score_func=None, **kwargs)[source]#

Most probable spelling correction for word.

Parameters

word (str) –

Returns

result

Return type

str

class malaya.spelling_correction.probability.ProbabilityLM[source]#

The SpellCorrector extends the functionality of the Peter Norvig’s with Language Model. spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.

correct(word, string, index=- 1, lookback=3, lookforward=3, **kwargs)[source]#

Correct a word within a text, returning the corrected word.

Parameters
  • word (str) –

  • string (List[str]) – Entire string, word must a word inside string.

  • index (int, optional (default=-1)) – index of word in the string, if -1, will try to use string.index(word).

  • lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

Returns

result

Return type

str

correct_text(text, lookback=3, lookforward=3)[source]#

Correct all the words within a text, returning the corrected text.

Parameters
  • text (str) –

  • lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

Returns

result

Return type

str

correct_word(word, string, index=- 1, lookback=3, lookforward=3)[source]#

Spell-correct word, and preserve proper upper, lower and title case.

Parameters
  • word (str) –

  • string (List[str]) – Tokenized string, word must a word inside string.

  • index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).

  • lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.

  • lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.

Returns

result

Return type

str

correct_match(match, string, index=- 1, lookback=3, lookforward=3)[source]#

Spell-correct word in re.match, and preserve proper upper, lower, title case.

malaya.spelling_correction.spylls#

malaya.spelling_correction.spylls.load(model='libreoffice-pejam', **kwargs)[source]#

Load a spylls Spell Corrector for Malay.

Parameters

model (str, optional (default='libreoffice-pejam')) –

Model spelling correction supported. Allowed values:

Returns

result

Return type

malaya.spelling_correction.spylls.Spylls class

class malaya.spelling_correction.spylls.Spylls[source]#
correct(word)[source]#

Correct a word within a text, returning the corrected word.

Parameters

word (str) –

Returns

result

Return type

str

edit_candidates(word)[source]#

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

List[str]

malaya.spelling_correction.symspell#

class malaya.spelling_correction.symspell.Symspell(model, verbosity, corpus, k=10)[source]#

The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation

edit_step(word)[source]#

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

{candidate1, candidate2}

edit_candidates(word, get_score=False)[source]#

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

List[str]

correct(word, **kwargs)[source]#

Most probable spelling correction for word.

Parameters

word (str) –

Returns

result

Return type

str

malaya.spelling_correction.symspell.load(max_edit_distance_dictionary=2, prefix_length=7, term_index=0, count_index=1, top_k=10, **kwargs)[source]#

Load a symspell Spell Corrector for Malay.

Returns

result

Return type

malaya.spelling_correction.symspell.Symspell class

class malaya.spelling_correction.symspell.Symspell[source]#

The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation

edit_step(word)[source]#

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

{candidate1, candidate2}

edit_candidates(word, get_score=False)[source]#

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

List[str]

correct(word, **kwargs)[source]#

Most probable spelling correction for word.

Parameters

word (str) –

Returns

result

Return type

str

malaya.summarization.abstractive#

malaya.summarization.abstractive.huggingface(model='mesolitica/finetune-summarization-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to abstractive summarization.

Parameters
  • model (str, optional (default='mesolitica/finetune-summarization-t5-small-standard-bahasa-cased')) – Check available models at malaya.summarization.abstractive.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Summarization

malaya.summarization.extractive#

malaya.summarization.extractive.encoder(vectorizer)[source]#

Encoder interface for summarization.

Parameters

vectorizer (object) – encoder interface object, eg, BERT, XLNET, ALBERT, ALXLNET. should have vectorize method.

Returns

result

Return type

malaya.model.extractive_summarization.Encoder

malaya.summarization.extractive.sklearn(model, vectorizer)[source]#

sklearn interface for summarization.

Parameters
  • model (object) –

    Should have fit_transform method. Commonly:

    • sklearn.decomposition.TruncatedSVD - LSA algorithm.

    • sklearn.decomposition.LatentDirichletAllocation - LDA algorithm.

  • vectorizer (object) –

    Should have fit_transform method. Commonly:

    • sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algorithm.

    • sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm.

Returns

result

Return type

malaya.model.extractive_summarization.SKLearn

malaya.topic_model.decomposition#

malaya.topic_model.decomposition.fit(corpus, model, vectorizer, n_topics, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, **kwargs)[source]#

Train a SKlearn model to do topic modelling based on corpus given.

Parameters
  • corpus (list) –

  • model (object) –

    Should have fit_transform method. Commonly:

    • sklearn.decomposition.TruncatedSVD - LSA algorithm.

    • sklearn.decomposition.LatentDirichletAllocation - LDA algorithm.

    • sklearn.decomposition.NMF - NMF algorithm.

  • vectorizer (object) –

    Should have fit_transform method. Commonly:

    • sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algorithm.

    • sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm.

  • n_topics (int, (default=10)) – size of decomposition column.

  • cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

malaya.topic_model.decomposition.Topic class

class malaya.topic_model.decomposition.Topic[source]#
visualize_topics(notebook_mode=False, mds='pcoa')[source]#

Print important topics based on decomposition.

Parameters

mds (str, optional (default='pcoa')) –

2D Decomposition. Allowed values:

  • 'pcoa' - Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)

  • 'mmds' - Dimension reduction via Multidimensional scaling

  • 'tsne' - Dimension reduction via t-distributed stochastic neighbor embedding

top_topics(len_topic, top_n=10, return_df=True)[source]#

Print important topics based on decomposition.

Parameters
  • len_topic (int) – size of topics.

  • top_n (int, optional (default=10)) – top n of each topic.

  • return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.

get_topics(len_topic)[source]#

Return important topics based on decomposition.

Parameters

len_topic (int) –

Returns

result

Return type

List[str]

get_sentences(len_sentence, k=0)[source]#

Return important sentences related to selected column based on decomposition.

Parameters
  • len_sentence (int) –

  • k (int, (default=0)) – index of decomposition matrix.

Returns

result

Return type

List[str]

malaya.topic_model.transformer#

class malaya.topic_model.transformer.AttentionTopic[source]#
top_topics(len_topic, top_n=10, return_df=True)[source]#

Print important topics based on decomposition.

Parameters
  • len_topic (int) – size of topics.

  • top_n (int, optional (default=10)) – top n of each topic.

  • return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.

get_topics(len_topic)[source]#

Return important topics based on decomposition.

Parameters

len_topic (int) – size of topics.

Returns

result

Return type

List[str]

malaya.zero_shot.classification#

malaya.zero_shot.classification.huggingface(model='mesolitica/finetune-mnli-nanot5-small', force_check=True, **kwargs)[source]#

Load HuggingFace model to zeroshot text classification.

Parameters
  • model (str, optional (default='mesolitica/finetune-mnli-nanot5-small')) – Check available models at malaya.zero_shot.classification.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.ZeroShotClassification

malaya.cluster#

malaya.cluster.cluster_words(list_words, lowercase=False)[source]#

cluster similar words based on structure, eg, [‘mahathir mohamad’, ‘mahathir’] = [‘mahathir mohamad’]. big O = n^2

Parameters
  • list_words (List[str]) –

  • lowercase (bool, optional (default=True)) – if True, will group using lowercase but maintain the original form.

Returns

string

Return type

List[str]

malaya.cluster.cluster_pos(result)[source]#

cluster similar POS.

Parameters

result (List[Tuple[str, str]]) –

Returns

result

Return type

Dict[str, List[str]]

malaya.cluster.cluster_entities(result)[source]#

cluster similar Entities.

Parameters

result (List[Tuple[str, str]]) –

Returns

result

Return type

Dict[str, List[str]]

malaya.cluster.cluster_tagging(result)[source]#

cluster any tagging results, as long the data passed [(string, label), (string, label)].

Parameters

result (List[Tuple[str, str]]) –

Returns

result

Return type

Dict[str, List[str]]

malaya.constituency#

malaya.constituency.huggingface(model='mesolitica/constituency-parsing-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to Constituency parsing.

Parameters
  • model (str, optional (default='mesolitica/constituency-parsing-t5-small-standard-bahasa-cased')) – Check available models at malaya.constituency.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Constituency

malaya.dependency#

malaya.dependency.dependency_graph(tagging, indexing)[source]#

Return helper object for dependency parser results. Only accept tagging and indexing outputs from dependency models.

malaya.dependency.huggingface(model='mesolitica/finetune-dependency-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to dependency parsing.

Parameters
  • model (str, optional (default='mesolitica/finetune-dependency-t5-small-standard-bahasa-cased')) – Check available models at malaya.dependency.available_huggingface().

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Dependency

malaya.embedding#

malaya.embedding.huggingface(model='mesolitica/embedding-malaysian-mistral-64M-32k', force_check=True, **kwargs)[source]#

Load HuggingFace model for embedding task.

Parameters
  • model (str, optional (default='mesolitica/embedding-malaysian-mistral-64M-32k')) – Check available models at malaya.embedding.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Embedding

malaya.emotion#

malaya.emotion.multinomial(**kwargs)[source]#

Load multinomial emotion model.

Returns

result

Return type

malaya.model.ml.MulticlassBayes class

malaya.emotion.huggingface(model='mesolitica/emotion-analysis-nanot5-small-malaysian-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to classify emotion.

Parameters
  • model (str, optional (default='mesolitica/emotion-analysis-nanot5-small-malaysian-cased')) – Check available models at malaya.emotion.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Classification

malaya.entity#

malaya.entity.huggingface(model='mesolitica/ner-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to Entity Recognition.

Parameters
  • model (str, optional (default='mesolitica/ner-t5-small-standard-bahasa-cased')) – Check available models at malaya.entity.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Tagging

malaya.entity.general_entity(model=None)[source]#

Load Regex based general entities tagging along with another supervised entity tagging model.

Parameters

model (object) – model must have predict method. Make sure the predict method returned [(string, label), (string, label)].

Returns

result

Return type

malaya.text.entity.EntityRegex class

malaya.jawi#

malaya.jawi.huggingface(model='mesolitica/jawi-nanot5-small-malaysian-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to translate.

Parameters
  • model (str, optional (default='mesolitica/jawi-nanot5-small-malaysian-cased')) – Check available models at malaya.jawi.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Translation

malaya.knowledge_graph#

malaya.knowledge_graph.huggingface(model='mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to convert text to triplet format knowledge graph.

Parameters
  • model (str, optional (default='mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased')) – Check available models at malaya.knowledge_graph.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.TexttoKG

malaya.language_detection#

malaya.language_detection.fasttext(model='mesolitica/fasttext-language-detection-v2', quantized=True, **kwargs)[source]#

Load Fasttext language detection model.

Parameters
  • model (str, optional (default='mesolitica/fasttext-language-detection-v2')) –

  • quantized (bool, optional (default=True)) – if True, load quantized fasttext model. Else, load original fasttext model.

Returns

result

Return type

malaya.model.ml.LanguageDetection class

malaya.language_detection.substring_rules(model, **kwargs)[source]#

detect EN, MS, MANDARIN and OTHER languages in a string.

EN words detection are using pyenchant from https://pyenchant.github.io/pyenchant/ and user language detection model.

MS words detection are using malaya.text.function.is_malay and user language detection model.

OTHER words detection are using any language detection classification model, such as, malaya.language_detection.fasttext.

Parameters

model (Callable) – Callable model, must have predict method.

Returns

result

Return type

malaya.model.rules.LanguageDict class

malaya.language_model#

malaya.language_model.kenlm(model='dump-combined', **kwargs)[source]#

Load KenLM language model.

Parameters

model (str, optional (default='dump-combined')) – Check available models at malaya.language_model.available_kenlm.

Returns

result

Return type

kenlm.Model class

malaya.language_model.gpt2(model='mesolitica/gpt2-117m-bahasa-cased', force_check=True, **kwargs)[source]#

Load GPT2 language model.

Parameters
  • model (str, optional (default='mesolitica/gpt2-117m-bahasa-cased')) – Check available models at malaya.language_model.available_gpt2.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.gpt2_lm.LM class

malaya.language_model.mlm(model='mesolitica/malaysian-debertav2-base', force_check=True, **kwargs)[source]#

Load Masked language model.

Parameters
  • model (str, optional (default='mesolitica/malaysian-debertav2-base')) – Check available models at malaya.language_model.available_mlm.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.mask_lm.MLMScorer class

malaya.llm#

malaya.nsfw#

malaya.nsfw.lexicon(**kwargs)[source]#

Load Lexicon NSFW model.

Returns

result

Return type

malaya.text.lexicon.nsfw.Lexicon class

malaya.nsfw.multinomial(**kwargs)[source]#

Load multinomial NSFW model.

Returns

result

Return type

malaya.model.ml.BAYES class

malaya.num2word#

malaya.num2word.to_cardinal(number)[source]#

Translate from number input to cardinal text representation

Parameters

number (real number) –

Returns

result – cardinal representation

Return type

str

malaya.num2word.to_ordinal(number)[source]#

Translate from number input to ordinal text representation

Parameters

number (real number) –

Returns

result – ordinal representation

Return type

str

malaya.num2word.to_ordinal_num(number)[source]#

Translate from number input to ordinal numering text representation

Parameters

number (int) –

Returns

result – ordinal numering representation

Return type

str

malaya.num2word.to_currency(value)[source]#

Translate from number input to cardinal currency text representation

Parameters

number (int) –

Returns

result – cardinal currency representation

Return type

str

malaya.num2word.to_year(value)[source]#

Translate from number input to cardinal year text representation

Parameters

number (int) –

Returns

result – cardinal year representation

Return type

str

malaya.paraphrase#

malaya.paraphrase.huggingface(model='mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to paraphrase.

Parameters
  • model (str, optional (default='mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased')) – Check available models at malaya.paraphrase.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Paraphrase

malaya.pos#

malaya.pos.huggingface(model='mesolitica/pos-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to Part-of-Speech Recognition.

Parameters
  • model (str, optional (default='mesolitica/pos-t5-small-standard-bahasa-cased')) – Check available models at malaya.pos.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Tagging

malaya.preprocessing#

malaya.preprocessing.preprocessing(normalize=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'], annotate=['allcaps', 'elongated', 'repeated', 'emphasis', 'censored', 'hashtag'], lowercase=True, fix_unidecode=True, expand_english_contractions=True, segmenter=None, demoji=None, **kwargs)[source]#

Load Preprocessing class.

Parameters
  • normalize (List[str], optional (default=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'])) – normalizing tokens, can check all supported normalizing at malaya.preprocessing.get_normalize().

  • annotate (List[str], optional (default=['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'])) – annonate tokens <open></open>, only accept [‘hashtag’, ‘allcaps’, ‘elongated’, ‘repeated’, ‘emphasis’, ‘censored’].

  • lowercase (bool, optional (default=True)) –

  • fix_unidecode (bool, optional (default=True)) – fix unidecode using ftfy.fix_text.

  • expand_english_contractions (bool, optional (default=True)) – expand english contractions.

  • segmenter (Callable, optional (default=None)) – function to segmentize word. If provide, it will expand hashtags, #mondayblues == monday blues

  • demoji (object) – demoji object, need to have a method demoji.

Returns

result

Return type

malaya.preprocessing.Preprocessing class

malaya.preprocessing.demoji()[source]#

Download latest emoji malay description from https://github.com/huseinzol05/malay-dataset/tree/master/dictionary/emoji

Returns

result

Return type

malaya.preprocessing.Demoji class

class malaya.preprocessing.Preprocessing[source]#
class malaya.preprocessing.Demoji[source]#
demoji(string)[source]#

Find emojis with string representation. 🔥 -> emoji api.

Parameters

string (str) –

Returns

result

Return type

Dist[str]

malaya.segmentation#

malaya.segmentation.huggingface(model='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to segmentation.

Parameters
  • model (str, optional (default='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased')) – Check available models at malaya.segmentation.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Generator

malaya.sentiment#

malaya.sentiment.multinomial(**kwargs)[source]#

Load multinomial sentiment model.

Returns

result

Return type

malaya.model.ml.Bayes class

malaya.sentiment.huggingface(model='mesolitica/sentiment-analysis-nanot5-small-malaysian-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to classify sentiment.

Parameters
  • model (str, optional (default='mesolitica/sentiment-analysis-nanot5-small-malaysian-cased')) – Check available models at malaya.sentiment.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Classification

malaya.stack#

malaya.stack.voting_stack(models, text)[source]#

Stacking for POS, Entities and Dependency models.

Parameters
  • models (list) – list of models.

  • text (str) – string to predict.

Returns

result

Return type

list

malaya.stack.predict_stack(models, strings, aggregate=<function gmean>, **kwargs)[source]#

Stacking for predictive models.

Parameters
  • models (List[Callable]) – list of models.

  • strings (List[str]) –

  • aggregate (Callable, optional (default=scipy.stats.mstats.gmean)) – Aggregate function.

Returns

result

Return type

dict

malaya.stem#

malaya.stem.naive()[source]#

Load stemming model using startswith and endswith naively using regex patterns.

Returns

result

Return type

malaya.stem.Naive class

malaya.stem.sastrawi()[source]#

Load stemming model using Sastrawi, this also include lemmatization.

Returns

result

Return type

malaya.stem.Sastrawi class

malaya.stem.huggingface(model='mesolitica/stem-lstm-512', force_check=True, **kwargs)[source]#

Load HuggingFace model to stem and lemmatization.

Parameters
  • model (str, optional (default='mesolitica/stem-lstm-512')) – Check available models at malaya.stem.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.rnn.Stem

malaya.syllable#

malaya.syllable.rules(**kwargs)[source]#

Load rules based syllable tokenizer. originally from https://github.com/fahadh4ilyas/syllable_splitter/blob/master/SyllableSplitter.py - improved cuaca double vocal ua based on https://en.wikipedia.org/wiki/Comparison_of_Indonesian_and_Standard_Malay#Syllabification - improved rans double consonant ns based on https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1 - improved au and ai double vocal.

Returns

result

Return type

malaya.syllable.Tokenizer class

malaya.syllable.huggingface(model='mesolitica/syllable-lstm', force_check=True, **kwargs)[source]#

Load HuggingFace model for syllable tokenizer.

Parameters
  • model (str, optional (default='mesolitica/syllable-lstm')) – Check available models at malaya.syllable.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.rnn.Syllable

malaya.tatabahasa#

malaya.tatabahasa.huggingface(model='mesolitica/finetune-tatabahasa-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to fix kesalahan tatabahasa.

Parameters
  • model (str, optional (default='mesolitica/finetune-tatabahasa-t5-small-standard-bahasa-cased')) – Check available models at malaya.tatabahasa.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Tatabahasa

malaya.tokenizer#

class malaya.tokenizer.Tokenizer[source]#
tokenize(string, lowercase=False)[source]#

Tokenize string into words.

Parameters
  • string (str) –

  • lowercase (bool, optional (default=False)) –

Returns

result

Return type

List[str]

class malaya.tokenizer.SentenceTokenizer[source]#
tokenize(string, minimum_length=5)[source]#

Tokenize string into multiple strings.

Parameters
  • string (str) –

  • minimum_length (int, optional (default=5)) – minimum length to assume a string is a string, default 5 characters.

Returns

result

Return type

List[str]

malaya.transformer#

malaya.transformer.huggingface(model='mesolitica/electra-base-generator-bahasa-cased', **kwargs)[source]#

Load transformer model.

Parameters
  • model (str, optional (default='mesolitica/electra-base-generator-bahasa-cased')) – Check available models at malaya.transformer.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

malaya.translation#

malaya.translation.word(model='mesolitica/word-en-ms', **kwargs)[source]#

Load word dictionary, based on google translate.

Parameters
  • model (str) – Check available models at malaya.translation.available_word.

  • (default='mesolitica/word-en-ms') (optional) – Check available models at malaya.translation.available_word.

Returns

result

Return type

Dict[str, str]

malaya.translation.huggingface(model='mesolitica/translation-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to translate.

Parameters
  • model (str, optional (default='mesolitica/translation-t5-small-standard-bahasa-cased')) – Check available models at malaya.translation.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Translation

malaya.true_case#

malaya.true_case.huggingface(model='mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased', force_check=True, **kwargs)[source]#

Load HuggingFace model to true case.

Parameters
  • model (str, optional (default='mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased')) – Check available models at malaya.true_case.available_huggingface.

  • force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.

Returns

result

Return type

malaya.torch_model.huggingface.Generator

malaya.word2num#

malaya.word2num.word2num(string)[source]#

Translate from string to number, eg ‘kesepuluh’ -> 10.

Parameters

string (str) –

Returns

result

Return type

int / float

malaya.wordvector#

malaya.wordvector.load(model='wikipedia', **kwargs)[source]#

Load pretrained word vectors.

Parameters

model (str, optional (default='wikipedia')) – Check available models at malaya.wordvector.available_wordvector.

Returns

  • vocabulary (indices dictionary for vector.)

  • vector (np.array, 2D.)

malaya.model.extractive_summarization#

class malaya.model.extractive_summarization.SKLearn[source]#
word_level(corpus, isi_penting=None, window_size=10, important_words=10, **kwargs)[source]#

Summarize list of strings / string on word level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • window_size (int, (default=10)) – window size for each word.

  • important_words (int, (default=10)) – number of important words.

Returns

dict

Return type

{‘top-words’, ‘cluster-top-words’, ‘score’}

sentence_level(corpus, isi_penting=None, top_k=3, important_words=10, **kwargs)[source]#

Summarize list of strings / string on sentence level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • top_k (int, (default=3)) – number of summarized strings.

  • important_words (int, (default=10)) – number of important words.

Returns

dict

Return type

{‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’}

class malaya.model.extractive_summarization.Doc2Vec[source]#
word_level(corpus, isi_penting=None, window_size=10, aggregation=<function mean>, soft=False, **kwargs)[source]#

Summarize list of strings / string on sentence level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • window_size (int, (default=10)) – window size for each word.

  • aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.

  • soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.

Returns

dict

Return type

{‘score’}

sentence_level(corpus, isi_penting=None, top_k=3, aggregation=<function mean>, soft=False, **kwargs)[source]#

Summarize list of strings / string on sentence level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • top_k (int, (default=3)) – number of summarized strings.

  • aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.

  • soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.

Returns

dict

Return type

{‘summary’, ‘score’}

class malaya.model.extractive_summarization.Encoder[source]#
word_level(corpus, isi_penting=None, window_size=10, important_words=10, batch_size=16, **kwargs)[source]#

Summarize list of strings / string on word level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • window_size (int, (default=10)) – window size for each word.

  • important_words (int, (default=10)) – number of important words.

  • batch_size (int, (default=16)) – for each feed-forward, we only feed N size of texts for each batch. This to prevent OOM.

Returns

dict

Return type

{‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’}

sentence_level(corpus, isi_penting=None, top_k=3, important_words=10, batch_size=16, **kwargs)[source]#

Summarize list of strings / string on sentence level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • top_k (int, (default=3)) – number of summarized strings.

  • important_words (int, (default=10)) – number of important words.

  • batch_size (int, (default=16)) – for each feed-forward, we only feed N size of texts for each batch. This to prevent OOM.

Returns

dict

Return type

{‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’}

malaya.model.ml#

class malaya.model.ml.MulticlassBayes[source]#
predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

class malaya.model.ml.BinaryBayes[source]#
predict(strings, add_neutral=True)[source]#

classify list of strings.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[str]

predict_proba(strings, add_neutral=True)[source]#

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[dict[str, float]]

class malaya.model.ml.MultilabelBayes[source]#
predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[List[str]]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (list) –

Returns

result

Return type

List[dict[str, float]]

malaya.model.rules#

class malaya.model.rules.LanguageDict[source]#
predict(words, acceptable_ms_label=['malay', 'ind'], acceptable_en_label=['eng', 'manglish'], ignore_capital=False, use_is_malay=True, predict_mandarin=False)[source]#

Predict [EN, MS, OTHERS, CAPITAL, NOT_LANG] on word level. This method assumed the string already tokenized.

Parameters
  • words (List[str]) –

  • acceptable_ms_label (List[str], optional (default = ['malay', 'ind'])) – accept labels from language detection model to assume a word is MS.

  • acceptable_en_label (List[str], optional (default = ['eng', 'manglish'])) – accept labels from language detection model to assume a word is EN.

  • ignore_capital (bool, optional (default=False)) – if True, will predict language for capital word.

  • use_is_malay (bool, optional (default=True)) – if True`, will predict MS word using malaya.dictionary.is_malay, else use language detection model.

  • predict_mandarin (bool, optional (default=False)) – if True, will slide the string to match pinyin dict.

Returns

result

Return type

List[str]

malaya.torch_model.gpt2_lm#

malaya.torch_model.huggingface#

class malaya.torch_model.huggingface.Generator[source]#
generate(strings, return_generate=False, prefix=None, **kwargs)[source]#

Generate texts from the input.

Parameters
Returns

result

Return type

List[str]

alignment(source, target)[source]#

align texts using cross attention and dtw-python.

Parameters
  • source (List[str]) –

  • target (List[str]) –

Returns

result

Return type

Dict

class malaya.torch_model.huggingface.Prefix[source]#
generate(string, **kwargs)[source]#

Generate texts from the input.

Parameters
Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Paraphrase[source]#
generate(strings, postprocess=True, **kwargs)[source]#

Generate texts from the input.

Parameters
Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Summarization[source]#
generate(strings, postprocess=True, n=2, threshold=0.1, reject_similarity=0.85, **kwargs)[source]#

Generate texts from the input.

Parameters
  • strings (List[str]) –

  • postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed biased generated international news publisher.

  • n (int, optional (default=2)) – N size of rouge to filter

  • threshold (float, optional (default=0.1)) – minimum threshold for N rouge score to select a sentence.

  • reject_similarity (float, optional (default=0.85)) – reject similar sentences while maintain position.

  • **kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Similarity[source]#
predict_proba(strings_left, strings_right)[source]#

calculate similarity for two different batch of texts.

Parameters
  • strings_left (List[str]) –

  • strings_right (List[str]) –

Returns

list

Return type

List[float]

class malaya.torch_model.huggingface.ZeroShotClassification[source]#
predict_proba(strings, labels, prefix='ayat ini berkaitan tentang ', multilabel=True)[source]#

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • labels (List[str]) –

  • prefix (str, optional (default='ayat ini berkaitan tentang ')) – prefix of labels to zero shot. Playing around with prefix can get better results.

  • multilabel (bool, optional (default=True)) – probability of labels can be more than 1.0

Returns

list

Return type

List[Dict[str, float]]

class malaya.torch_model.huggingface.ExtractiveQA[source]#
predict(paragraph_text, question_texts, validate_answers=True, validate_questions=False, minimum_threshold_question=0.05, **kwargs)[source]#

Predict extractive answers from questions given a paragraph.

Parameters
  • paragraph_text (str) –

  • question_texts (List[str]) – List of questions, results really depends on case sensitive questions.

  • validate_answers (bool, optional (default=True)) – if True, will check the answer is inside the paragraph.

  • validate_questions (bool, optional (default=False)) – if True, validate the question is subset of the paragraph using sklearn.feature_extraction.text.CountVectorizer it is only useful if paragraph_text and question_texts are the same language.

  • minimum_threshold_question (float, optional (default=0.05)) – minimum score from cosine_similarity, only useful if validate_questions = True.

  • **kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Transformer[source]#
vectorize(strings, method='last', method_token='first', t5_head_logits=True, **kwargs)[source]#

Vectorize string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    hidden layers supported. Allowed values:

    • 'last' - last layer.

    • 'first' - first layer.

    • 'mean' - average all layers.

    This only applicable for non T5 models.

  • method_token (str, optional (default='first')) –

    token layers supported. Allowed values:

    • 'last' - last token.

    • 'first' - first token.

    • 'mean' - average all tokens.

    usually pretrained models trained on first token for classification task. This only applicable for non T5 models.

  • t5_head_logits (str, optional (default=True)) – if True, will take head logits, else, last token. This only applicable for T5 models.

Returns

result

Return type

np.array

attention(strings, method='last', method_head='mean', t5_attention='cross_attentions', **kwargs)[source]#

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • method_head (str, optional (default='mean')) –

    attention head layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • t5_attention (str, optional (default='cross_attentions')) –

    attention type for T5 models. Allowed values:

    • 'cross_attentions' - cross attention.

    • 'encoder_attentions' - encoder attention.

    • 'decoder_attentions' - decoder attention.

    This only applicable for T5 models.

Returns

result

Return type

List[List[Tuple[str, float]]]

class malaya.torch_model.huggingface.IsiPentingGenerator[source]#
generate(strings, mode='surat-khabar', remove_html_tags=True, **kwargs)[source]#

generate a long text given a isi penting.

Parameters
  • strings (List[str]) –

  • mode (str, optional (default='surat-khabar')) –

    Mode supported. Allowed values:

    • 'surat-khabar' - news style writing.

    • 'tajuk-surat-khabar' - headline news style writing.

    • 'artikel' - article style writing.

    • 'penerangan-produk' - product description style writing.

    • 'karangan' - karangan sekolah style writing.

  • remove_html_tags (bool, optional (default=True)) – Will remove html tags using malaya.text.function.remove_html_tags.

  • **kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Tatabahasa[source]#
generate(strings, **kwargs)[source]#

Fix kesalahan tatatabahasa.

Parameters
Returns

result

Return type

List[Tuple[str, int]]

class malaya.torch_model.huggingface.Keyword[source]#
generate(strings, top_keywords=5, **kwargs)[source]#

Generate texts from the input.

Parameters
Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Translation[source]#
generate(strings, to_lang='ms', **kwargs)[source]#

Generate texts from the input.

Parameters
Returns

result

Return type

List[str]

class malaya.torch_model.huggingface.Classification[source]#
predict(strings)[source]#

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings)[source]#

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

class malaya.torch_model.huggingface.Tagging[source]#
predict(string)[source]#

Tag a string.

Parameters

string (str) –

Returns

result

Return type

Tuple[str, str]

analyze(string)[source]#

Analyze a string.

Parameters

string (str) –

Returns

result

Return type

{‘words’: List[str], ‘tags’: [{‘text’: ‘text’, ‘type’: ‘location’, ‘score’: 1.0, ‘beginOffset’: 0, ‘endOffset’: 1}]}

class malaya.torch_model.huggingface.Embedding[source]#
encode(strings)[source]#

Encode strings into embedding.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

class malaya.torch_model.huggingface.Reranker[source]#
sort(left_string, right_strings)[source]#

Sort the strings.

Parameters
  • left_string (str) – reference string.

  • right_strings (List[str]) – query strings, list of strings need to sort based on reference string.

Returns

result

Return type

np.array

malaya.torch_model.mask_lm#

class malaya.torch_model.mask_lm.MLMScorer[source]#
score(string)[source]#

score a string.

Parameters

string (str) –

Returns

result

Return type

float