API
Contents
API#
malaya#
malaya.augmentation.abstractive#
-
malaya.augmentation.abstractive.
huggingface
(model='mesolitica/translation-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to abstractive text augmentation.
- Parameters
model (str, optional (default='mesolitica/translation-t5-small-standard-bahasa-cased')) – Check available models at malaya.augmentation.abstractive.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.augmentation.rules#
-
malaya.augmentation.rules.
synonym
(string, threshold=0.5, top_n=5, **kwargs)[source]# augmenting a string using synonym, https://github.com/huseinzol05/Malaya-Dataset#90k-synonym
- Parameters
string (str) – this string input assumed been properly tokenized and cleaned.
threshold (float, optional (default=0.5)) – random selection for a word.
top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.
- Returns
result
- Return type
List[str]
-
malaya.augmentation.rules.
replace_similar_consonants
(word, threshold=0.5, replace_consonants={'b': ['n'], 'd': ['s', 'f'], 'f': ['p'], 'g': ['f', 'h'], 'j': ['k'], 'k': ['l'], 'n': ['m'], 'r': ['t', 'q']})[source]# Naively replace consonants with another consonants to simulate typo or slang if after consonants is a vowel.
- Parameters
word (str) –
threshold (float, optional (default=0.5)) –
- Returns
result
- Return type
List[str]
-
malaya.augmentation.rules.
replace_similar_vowels
(word, threshold=0.5, replace_vowels={'a': ['o'], 'i': ['o'], 'o': ['u'], 'u': ['o']})[source]# Naively replace vowels with another vowels to simulate typo or slang if after vowels is a consonant.
- Parameters
word (str) –
threshold (float, optional (default=0.5)) –
- Returns
result
- Return type
str
augmenting a word into socialmedia form.
- Parameters
word (str) –
- Returns
result
- Return type
List[str]
-
malaya.augmentation.rules.
vowel_alternate
(word, threshold=0.5)[source]# augmenting a word into vowel alternate.
vowel_alternate(‘singapore’) -> sngpore
vowel_alternate(‘kampung’) -> kmpng
vowel_alternate(‘ayam’) -> aym
- Parameters
word (str) –
threshold (float, optional (default=0.5)) –
- Returns
result
- Return type
str
malaya.dictionary#
-
malaya.dictionary.
keyword_wiktionary
(word, acceptable_lang=['brunei malay', 'malay'])[source]# crawl https://en.wiktionary.org/wiki/ to check a word is a malay word.
- Parameters
word (str) –
acceptable_lang (List[str], optional (default=['brunei malay', 'malay'])) – acceptable languages in wiktionary section.
- Returns
result
- Return type
Dict
-
malaya.dictionary.
keyword_dbp
(word, parse=False)[source]# crawl https://prpm.dbp.gov.my/cari1?keyword= to check a word is a malay word.
- Parameters
word (str) –
parse (bool, optional (default=False)) – if True, will parse using BeautifulSoup.
- Returns
result
- Return type
Dict
-
malaya.dictionary.
corpus_dbp
(word)[source]# crawl http://sbmb.dbp.gov.my/korpusdbp/Search2.aspx to search corpus based on a word.
- Parameters
word (str) –
- Returns
result
- Return type
pandas.core.frame.DataFrame
-
malaya.dictionary.
is_english
(word)[source]# Check a word is an english word.
- Parameters
word (str) –
- Returns
result
- Return type
bool
-
malaya.dictionary.
is_malay
(word, stemmer=None)[source]# Check a word is a malay word.
- Parameters
word (str) –
stemmer (Callable, optional (default=None)) – a Callable object, must have stem_word method.
- Returns
result
- Return type
bool
-
malaya.dictionary.
convert_pinyin
(string)[source]# Convert mandarin characters to pinyin form. Original vocab from https://github.com/lxyu/pinyin 你好 -> ni hao
- Parameters
string (str) –
- Returns
result
- Return type
str
malaya.generator.isi_penting#
-
malaya.generator.isi_penting.
huggingface
(model='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to generate text based on isi penting.
- Parameters
model (str, optional (default='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased')) – Check available models at malaya.generator.isi_penting.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.keyword.abstractive#
-
malaya.keyword.abstractive.
huggingface
(model='mesolitica/finetune-keyword-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to abstractive keyword.
- Parameters
model (str, optional (default='mesolitica/finetune-keyword-t5-small-standard-bahasa-cased')) – Check available models at malaya.keyword.abstractive.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.keyword.extractive#
-
malaya.keyword.extractive.
rake
(string, vocab=None, model=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]# Extract keywords using Rake algorithm.
- Parameters
string (str) –
vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.
model (Object, optional (default=None)) – model must has attention method.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] For automatic Ngram generator.
- Returns
result
- Return type
Tuple[float, str]
-
malaya.keyword.extractive.
textrank
(string, vocab=None, model=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]# Extract keywords using Textrank algorithm.
- Parameters
string (str) –
vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.
model (Object, optional (default='None')) – model must has fit_transform or vectorize method.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
Tuple[float, str]
-
malaya.keyword.extractive.
attention
(string, model, vocab=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]# Extract keywords using Attention mechanism.
- Parameters
string (str) –
model (Object) – model must has attention method.
vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
Tuple[float, str]
-
malaya.keyword.extractive.
similarity
(string, model, vocab=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]# Extract keywords using Sentence embedding VS keyword embedding similarity.
- Parameters
string (str) –
model (Object) – Transformer model or any model has vectorize method.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
Tuple[float, str]
malaya.normalizer.rules#
-
malaya.normalizer.rules.
load
(speller=None, stemmer=None, **kwargs)[source]# Load a Normalizer using any spelling correction model.
- Parameters
speller (Callable, optional (default=None)) – function to correct spelling, must have correct or normalize_elongated method.
stemmer (Callable, optional (default=None)) – function to stem, must have stem_word method. If provide stemmer, will accurately to stem kata imbuhan akhir.
- Returns
result
- Return type
malaya.normalizer.rules.Normalizer class
-
class
malaya.normalizer.rules.
Normalizer
[source]# -
normalize
(string, normalize_text=True, normalize_url=False, normalize_email=False, normalize_year=True, normalize_telephone=True, normalize_date=True, normalize_time=True, normalize_emoji=True, normalize_elongated=True, normalize_hingga=True, normalize_pada_hari_bulan=True, normalize_fraction=True, normalize_money=True, normalize_units=True, normalize_percent=True, normalize_ic=True, normalize_number=True, normalize_x_kali=True, normalize_cardinal=True, normalize_ordinal=True, normalize_entity=True, expand_contractions=True, check_english_func=<function is_english>, check_malay_func=<function is_malay>, translator=None, language_detection_word=None, acceptable_language_detection=['EN', 'CAPITAL', 'NOT_LANG'], segmenter=None, text_scorer=None, text_scorer_window=2, not_a_word_threshold=0.0001, dateparser_settings={'TIMEZONE': 'GMT+8'}, **kwargs)[source]# Normalize a string.
- Parameters
string (str) –
normalize_text (bool, optional (default=True)) – if True, will try to replace shortforms with internal corpus.
normalize_url (bool, optional (default=False)) – if True, replace :// with empty and . with dot. https://huseinhouse.com -> https huseinhouse dot com.
normalize_email (bool, optional (default=False)) – if True, replace @ with di, . with dot. husein.zol05@gmail.com -> husein dot zol kosong lima di gmail dot com.
normalize_year (bool, optional (default=True)) – if True, tahun 1987 -> tahun sembilan belas lapan puluh tujuh. if True, 1970-an -> sembilan belas tujuh puluh an. if False, tahun 1987 -> tahun seribu sembilan ratus lapan puluh tujuh.
normalize_telephone (bool, optional (default=True)) – if True, no 012-1234567 -> no kosong satu dua, satu dua tiga empat lima enam tujuh
normalize_date (bool, optional (default=True)) – if True, 01/12/2001 -> satu disember dua ribu satu. if True, Jun 2017 -> satu Jun dua ribu tujuh belas. if True, 2017 Jun -> satu Jun dua ribu tujuh belas. if False, 2017 Jun -> 01/06/2017. if False, Jun 2017 -> 01/06/2017.
normalize_time (bool, optional (default=True)) – if True, pukul 2.30 -> pukul dua tiga puluh minit. if False, pukul 2.30 -> ‘02:00:00’
normalize_emoji (bool, (default=True)) – if True, 🔥 -> emoji api Load from malaya.preprocessing.demoji.
normalize_elongated (bool, optional (default=True)) – if True, betuii -> betui.
normalize_hingga (bool, optional (default=True)) – if True, 2011 - 2019 -> dua ribu sebelas hingga dua ribu sembilan belas
normalize_pada_hari_bulan (bool, optional (default=True)) – if True, pada 10/4 -> pada sepuluh hari bulan empat
normalize_fraction (bool, optional (default=True)) – if True, 10 /4 -> sepuluh per empat
normalize_money (bool, optional (default=True)) – if True, rm10.4m -> sepuluh juta empat ratus ribu ringgit
normalize_units (bool, optional (default=True)) – if True, 61.2 kg -> enam puluh satu perpuluhan dua kilogram
normalize_percent (bool, optional (default=True)) – if True, 0.8% -> kosong perpuluhan lapan peratus
normalize_ic (bool, optional (default=True)) – if True, 911111-01-1111 -> sembilan satu satu satu satu satu sempang kosong satu sempang satu satu satu satu
normalize_number (bool, optional (default=True)) – if True 0123 -> kosong satu dua tiga
normalize_x_kali (bool, optional (default=True)) – if True 10x -> ‘sepuluh kali’
normalize_cardinal (bool, optional (default=True)) – if True, 123 -> seratus dua puluh tiga
normalize_ordinal (bool, optional (default=True)) – if True, ke-123 -> keseratus dua puluh tiga
normalize_entity (bool, optional (default=True)) – normalize entities, only effect date, datetime, time and money patterns string only.
expand_contractions (bool, optional (default=True)) – expand english contractions.
check_english_func (Callable, optional (default=malaya.text.function.is_english)) – function to check a word in english dictionary, default is malaya.text.function.is_english. this parameter also will be use for malay text normalization.
check_malay_func (Callable, optional (default=malaya.text.function.is_malay)) – function to check a word in malay dictionary, default is malaya.text.function.is_malay.
translator (Callable, optional (default=None)) – function to translate EN word to MS word.
language_detection_word (Callable, optional (default=None)) – function to detect language for each words to get better translation results.
acceptable_language_detection (List[str], optional (default=['EN', 'CAPITAL', 'NOT_LANG'])) – only translate substrings if the results from language_detection_word is in acceptable_language_detection.
segmenter (Callable, optional (default=None)) – function to segmentize word. If provide, it will expand a word, apaitu -> apa itu
text_scorer (Callable, optional (default=None)) – function to validate upper word. If lower case score is higher or equal than upper case score, will choose lower case.
text_scorer_window (int, optional (default=2)) – size of lookback and lookforward to validate upper word.
not_a_word_threshold (float, optional (default=1e-4)) – assume a word is not a human word if score lower than not_a_word_threshold. only usable if passed text_scorer parameter.
dateparser_settings (Dict, optional (default={'TIMEZONE': 'GMT+8'})) – default dateparser setting, check support settings at https://dateparser.readthedocs.io/en/latest/
- Returns
result
- Return type
{‘normalize’, ‘date’, ‘money’}
-
malaya.qa.extractive#
-
malaya.qa.extractive.
huggingface
(model='mesolitica/finetune-qa-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to answer extractive question answers.
- Parameters
model (str, optional (default='mesolitica/finetune-qa-t5-small-standard-bahasa-cased')) – Check available models at malaya.qa.extractive.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.similarity.doc2vec#
-
malaya.similarity.doc2vec.
vectorizer
(v)[source]# Doc2vec interface for text similarity using Encoder model.
- Parameters
v (object) – encoder interface object, BERT, XLNET. should have vectorize method.
- Returns
result
- Return type
-
class
malaya.similarity.doc2vec.
VectorizerSimilarity
[source]# -
predict_proba
(left_strings, right_strings, similarity='cosine')[source]# calculate similarity for two different batch of texts.
- Parameters
left_strings (list of str) –
right_strings (list of str) –
similarity (str, optional (default='mean')) –
similarity supported. Allowed values:
'cosine'
- cosine similarity.'euclidean'
- euclidean similarity.'manhattan'
- manhattan similarity.
- Returns
result
- Return type
List[float]
-
heatmap
(strings, similarity='cosine', visualize=True, annotate=True, figsize=(7, 7))[source]# plot a heatmap based on output from bert similarity.
- Parameters
strings (list of str) – list of strings.
similarity (str, optional (default='mean')) –
similarity supported. Allowed values:
'cosine'
- cosine similarity.'euclidean'
- euclidean similarity.'manhattan'
- manhattan similarity.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
result – list of results
- Return type
list
-
malaya.similarity.semantic#
malaya.spelling_correction.jamspell#
-
malaya.spelling_correction.jamspell.
load
(model='wiki', **kwargs)[source]# Load a jamspell Spell Corrector for Malay.
- Parameters
model (str, optional (default='wiki+news')) –
Supported models. Allowed values:
'wiki+news'
- Wikipedia + News, 337MB.'wiki'
- Wikipedia, 148MB.'news'
- local news, 215MB.
- Returns
result
- Return type
malaya.spell.JamSpell class
-
class
malaya.spelling_correction.jamspell.
JamSpell
[source]# -
correct
(word, string, index=- 1)[source]# Correct a word within a text, returning the corrected word.
- Parameters
word (str) –
string (List[str]) – Tokenized string, word must a word inside string.
index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).
- Returns
result
- Return type
str
-
correct_word
(word, string, index=- 1)[source]# Spell-correct word in re.match, and preserve proper upper, lower, title case.
-
correct_match
(match, string, index=- 1)[source]# Spell-correct word in re.match, and preserve proper upper, lower, title case.
-
correct_text
(text)[source]# Correct all the words within a text, returning the corrected text.
- Parameters
text (str) –
- Returns
result
- Return type
str
-
edit_candidates
(word, string, index=- 1)[source]# Generate candidates given a word.
- Parameters
word (str) –
string (str) – Entire string, word must a word inside string.
index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).
- Returns
result
- Return type
List[str]
-
malaya.spelling_correction.probability#
-
class
malaya.spelling_correction.probability.
Probability
(corpus, sp_tokenizer=None, stemmer=None, **kwargs)[source]# The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.
-
class
malaya.spelling_correction.probability.
ProbabilityLM
(language_model, corpus, sp_tokenizer=None, stemmer=None, **kwargs)[source]# The SpellCorrector extends the functionality of the Peter Norvig’s with Language Model. spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.
-
correct
(word, string, index=- 1, lookback=3, lookforward=3, **kwargs)[source]# Correct a word within a text, returning the corrected word.
- Parameters
word (str) –
string (List[str]) – Entire string, word must a word inside string.
index (int, optional (default=-1)) – index of word in the string, if -1, will try to use string.index(word).
lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
- Returns
result
- Return type
str
-
correct_text
(text, lookback=3, lookforward=3)[source]# Correct all the words within a text, returning the corrected text.
- Parameters
text (str) –
lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
- Returns
result
- Return type
str
-
correct_word
(word, string, index=- 1, lookback=3, lookforward=3)[source]# Spell-correct word, and preserve proper upper, lower and title case.
- Parameters
word (str) –
string (List[str]) – Tokenized string, word must a word inside string.
index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).
lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
- Returns
result
- Return type
str
-
-
malaya.spelling_correction.probability.
load
(language_model=None, sentence_piece=False, stemmer=None, additional_words={'la': 100000, 'ni': 100000, 'pun': 100000}, **kwargs)[source]# Load a Probability Spell Corrector.
- Parameters
language_model (Callable, optional (default=None)) – If not None, must an object with score method.
sentence_piece (bool, optional (default=False)) – if True, reduce possible augmentation states using sentence piece.
stemmer (Callable, optional (default=None)) – a Callable object, must have stem_word method.
additional_words (Dict[str, int], (default={'ni': 100000, 'pun': 100000, 'la': 100000})) – additional bias vocab.
- Returns
result – List of model classes:
if passed language_model will return malaya.spelling_correction.probability.ProbabilityLM.
else will return malaya.spelling_correction.probability.Probability.
- Return type
model
-
class
malaya.spelling_correction.probability.
Spell
[source]# -
-
edit_candidates
(word)[source]# Generate candidates given a word.
- Parameters
word (str) –
- Returns
result
- Return type
List[str]
-
correct_text
(text)[source]# Correct all the words within a text, returning the corrected text.
- Parameters
text (str) –
- Returns
result
- Return type
str
-
-
class
malaya.spelling_correction.probability.
Probability
[source]# The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.
-
class
malaya.spelling_correction.probability.
ProbabilityLM
[source]# The SpellCorrector extends the functionality of the Peter Norvig’s with Language Model. spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.
-
correct
(word, string, index=- 1, lookback=3, lookforward=3, **kwargs)[source]# Correct a word within a text, returning the corrected word.
- Parameters
word (str) –
string (List[str]) – Entire string, word must a word inside string.
index (int, optional (default=-1)) – index of word in the string, if -1, will try to use string.index(word).
lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
- Returns
result
- Return type
str
-
correct_text
(text, lookback=3, lookforward=3)[source]# Correct all the words within a text, returning the corrected text.
- Parameters
text (str) –
lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
- Returns
result
- Return type
str
-
correct_word
(word, string, index=- 1, lookback=3, lookforward=3)[source]# Spell-correct word, and preserve proper upper, lower and title case.
- Parameters
word (str) –
string (List[str]) – Tokenized string, word must a word inside string.
index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).
lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
- Returns
result
- Return type
str
-
malaya.spelling_correction.spylls#
-
malaya.spelling_correction.spylls.
load
(model='libreoffice-pejam', **kwargs)[source]# Load a spylls Spell Corrector for Malay.
- Parameters
model (str, optional (default='libreoffice-pejam')) –
Model spelling correction supported. Allowed values:
'libreoffice-pejam'
- from LibreOffice pEJAm, https://extensions.libreoffice.org/en/extensions/show/3868
- Returns
result
- Return type
malaya.spelling_correction.spylls.Spylls class
malaya.spelling_correction.symspell#
-
class
malaya.spelling_correction.symspell.
Symspell
(model, verbosity, corpus, k=10)[source]# The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation
-
edit_step
(word)[source]# Generate candidates given a word.
- Parameters
word (str) –
- Returns
result
- Return type
{candidate1, candidate2}
-
-
malaya.spelling_correction.symspell.
load
(max_edit_distance_dictionary=2, prefix_length=7, term_index=0, count_index=1, top_k=10, **kwargs)[source]# Load a symspell Spell Corrector for Malay.
- Returns
result
- Return type
malaya.spelling_correction.symspell.Symspell class
-
class
malaya.spelling_correction.symspell.
Symspell
[source]# The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation
-
edit_step
(word)[source]# Generate candidates given a word.
- Parameters
word (str) –
- Returns
result
- Return type
{candidate1, candidate2}
-
malaya.summarization.abstractive#
-
malaya.summarization.abstractive.
huggingface
(model='mesolitica/finetune-summarization-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to abstractive summarization.
- Parameters
model (str, optional (default='mesolitica/finetune-summarization-t5-small-standard-bahasa-cased')) – Check available models at malaya.summarization.abstractive.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.summarization.extractive#
-
malaya.summarization.extractive.
encoder
(vectorizer)[source]# Encoder interface for summarization.
- Parameters
vectorizer (object) – encoder interface object, eg, BERT, XLNET, ALBERT, ALXLNET. should have vectorize method.
- Returns
result
- Return type
-
malaya.summarization.extractive.
sklearn
(model, vectorizer)[source]# sklearn interface for summarization.
- Parameters
model (object) –
Should have fit_transform method. Commonly:
sklearn.decomposition.TruncatedSVD
- LSA algorithm.sklearn.decomposition.LatentDirichletAllocation
- LDA algorithm.
vectorizer (object) –
Should have fit_transform method. Commonly:
sklearn.feature_extraction.text.TfidfVectorizer
- TFIDF algorithm.sklearn.feature_extraction.text.CountVectorizer
- Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramCountVectorizer
- Skip Gram Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramTfidfVectorizer
- Skip Gram TFIDF algorithm.
- Returns
result
- Return type
malaya.topic_model.decomposition#
-
malaya.topic_model.decomposition.
fit
(corpus, model, vectorizer, n_topics, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, **kwargs)[source]# Train a SKlearn model to do topic modelling based on corpus given.
- Parameters
corpus (list) –
model (object) –
Should have fit_transform method. Commonly:
sklearn.decomposition.TruncatedSVD
- LSA algorithm.sklearn.decomposition.LatentDirichletAllocation
- LDA algorithm.sklearn.decomposition.NMF
- NMF algorithm.
vectorizer (object) –
Should have fit_transform method. Commonly:
sklearn.feature_extraction.text.TfidfVectorizer
- TFIDF algorithm.sklearn.feature_extraction.text.CountVectorizer
- Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramCountVectorizer
- Skip Gram Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramTfidfVectorizer
- Skip Gram TFIDF algorithm.
n_topics (int, (default=10)) – size of decomposition column.
cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
malaya.topic_model.decomposition.Topic class
-
class
malaya.topic_model.decomposition.
Topic
[source]# -
visualize_topics
(notebook_mode=False, mds='pcoa')[source]# Print important topics based on decomposition.
- Parameters
mds (str, optional (default='pcoa')) –
2D Decomposition. Allowed values:
'pcoa'
- Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)'mmds'
- Dimension reduction via Multidimensional scaling'tsne'
- Dimension reduction via t-distributed stochastic neighbor embedding
-
top_topics
(len_topic, top_n=10, return_df=True)[source]# Print important topics based on decomposition.
- Parameters
len_topic (int) – size of topics.
top_n (int, optional (default=10)) – top n of each topic.
return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.
-
malaya.topic_model.transformer#
-
class
malaya.topic_model.transformer.
AttentionTopic
[source]#
malaya.zero_shot.classification#
-
malaya.zero_shot.classification.
huggingface
(model='mesolitica/finetune-mnli-nanot5-small', force_check=True, **kwargs)[source]# Load HuggingFace model to zeroshot text classification.
- Parameters
model (str, optional (default='mesolitica/finetune-mnli-nanot5-small')) – Check available models at malaya.zero_shot.classification.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.cluster#
-
malaya.cluster.
cluster_words
(list_words, lowercase=False)[source]# cluster similar words based on structure, eg, [‘mahathir mohamad’, ‘mahathir’] = [‘mahathir mohamad’]. big O = n^2
- Parameters
list_words (List[str]) –
lowercase (bool, optional (default=True)) – if True, will group using lowercase but maintain the original form.
- Returns
string
- Return type
List[str]
-
malaya.cluster.
cluster_pos
(result)[source]# cluster similar POS.
- Parameters
result (List[Tuple[str, str]]) –
- Returns
result
- Return type
Dict[str, List[str]]
malaya.constituency#
-
malaya.constituency.
huggingface
(model='mesolitica/constituency-parsing-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to Constituency parsing.
- Parameters
model (str, optional (default='mesolitica/constituency-parsing-t5-small-standard-bahasa-cased')) – Check available models at malaya.constituency.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.huggingface.Constituency
malaya.dependency#
-
malaya.dependency.
dependency_graph
(tagging, indexing)[source]# Return helper object for dependency parser results. Only accept tagging and indexing outputs from dependency models.
-
malaya.dependency.
huggingface
(model='mesolitica/finetune-dependency-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to dependency parsing.
- Parameters
model (str, optional (default='mesolitica/finetune-dependency-t5-small-standard-bahasa-cased')) – Check available models at malaya.dependency.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.huggingface.Dependency
malaya.embedding#
-
malaya.embedding.
huggingface
(model='mesolitica/embedding-malaysian-mistral-64M-32k', force_check=True, **kwargs)[source]# Load HuggingFace model for embedding task.
- Parameters
model (str, optional (default='mesolitica/embedding-malaysian-mistral-64M-32k')) – Check available models at malaya.embedding.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.emotion#
-
malaya.emotion.
multinomial
(**kwargs)[source]# Load multinomial emotion model.
- Returns
result
- Return type
malaya.model.ml.MulticlassBayes class
-
malaya.emotion.
huggingface
(model='mesolitica/emotion-analysis-nanot5-small-malaysian-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to classify emotion.
- Parameters
model (str, optional (default='mesolitica/emotion-analysis-nanot5-small-malaysian-cased')) – Check available models at malaya.emotion.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.entity#
-
malaya.entity.
huggingface
(model='mesolitica/ner-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to Entity Recognition.
- Parameters
model (str, optional (default='mesolitica/ner-t5-small-standard-bahasa-cased')) – Check available models at malaya.entity.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
-
malaya.entity.
general_entity
(model=None)[source]# Load Regex based general entities tagging along with another supervised entity tagging model.
- Parameters
model (object) – model must have predict method. Make sure the predict method returned [(string, label), (string, label)].
- Returns
result
- Return type
malaya.text.entity.EntityRegex class
malaya.jawi#
-
malaya.jawi.
huggingface
(model='mesolitica/jawi-nanot5-small-malaysian-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to translate.
- Parameters
model (str, optional (default='mesolitica/jawi-nanot5-small-malaysian-cased')) – Check available models at malaya.jawi.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.knowledge_graph#
-
malaya.knowledge_graph.
huggingface
(model='mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to convert text to triplet format knowledge graph.
- Parameters
model (str, optional (default='mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased')) – Check available models at malaya.knowledge_graph.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.huggingface.TexttoKG
malaya.language_detection#
-
malaya.language_detection.
fasttext
(model='mesolitica/fasttext-language-detection-v2', quantized=True, **kwargs)[source]# Load Fasttext language detection model.
- Parameters
model (str, optional (default='mesolitica/fasttext-language-detection-v2')) –
quantized (bool, optional (default=True)) – if True, load quantized fasttext model. Else, load original fasttext model.
- Returns
result
- Return type
malaya.model.ml.LanguageDetection class
-
malaya.language_detection.
substring_rules
(model, **kwargs)[source]# detect EN, MS, MANDARIN and OTHER languages in a string.
EN words detection are using pyenchant from https://pyenchant.github.io/pyenchant/ and user language detection model.
MS words detection are using malaya.text.function.is_malay and user language detection model.
OTHER words detection are using any language detection classification model, such as, malaya.language_detection.fasttext.
- Parameters
model (Callable) – Callable model, must have predict method.
- Returns
result
- Return type
malaya.model.rules.LanguageDict class
malaya.language_model#
-
malaya.language_model.
kenlm
(model='dump-combined', **kwargs)[source]# Load KenLM language model.
- Parameters
model (str, optional (default='dump-combined')) – Check available models at malaya.language_model.available_kenlm.
- Returns
result
- Return type
kenlm.Model class
-
malaya.language_model.
gpt2
(model='mesolitica/gpt2-117m-bahasa-cased', force_check=True, **kwargs)[source]# Load GPT2 language model.
- Parameters
model (str, optional (default='mesolitica/gpt2-117m-bahasa-cased')) – Check available models at malaya.language_model.available_gpt2.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.gpt2_lm.LM class
-
malaya.language_model.
mlm
(model='mesolitica/malaysian-debertav2-base', force_check=True, **kwargs)[source]# Load Masked language model.
- Parameters
model (str, optional (default='mesolitica/malaysian-debertav2-base')) – Check available models at malaya.language_model.available_mlm.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.mask_lm.MLMScorer class
malaya.llm#
malaya.nsfw#
malaya.num2word#
-
malaya.num2word.
to_cardinal
(number)[source]# Translate from number input to cardinal text representation
- Parameters
number (real number) –
- Returns
result – cardinal representation
- Return type
str
-
malaya.num2word.
to_ordinal
(number)[source]# Translate from number input to ordinal text representation
- Parameters
number (real number) –
- Returns
result – ordinal representation
- Return type
str
-
malaya.num2word.
to_ordinal_num
(number)[source]# Translate from number input to ordinal numering text representation
- Parameters
number (int) –
- Returns
result – ordinal numering representation
- Return type
str
malaya.paraphrase#
-
malaya.paraphrase.
huggingface
(model='mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to paraphrase.
- Parameters
model (str, optional (default='mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased')) – Check available models at malaya.paraphrase.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.pos#
-
malaya.pos.
huggingface
(model='mesolitica/pos-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to Part-of-Speech Recognition.
- Parameters
model (str, optional (default='mesolitica/pos-t5-small-standard-bahasa-cased')) – Check available models at malaya.pos.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.preprocessing#
-
malaya.preprocessing.
preprocessing
(normalize=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'], annotate=['allcaps', 'elongated', 'repeated', 'emphasis', 'censored', 'hashtag'], lowercase=True, fix_unidecode=True, expand_english_contractions=True, segmenter=None, demoji=None, **kwargs)[source]# Load Preprocessing class.
- Parameters
normalize (List[str], optional (default=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'])) – normalizing tokens, can check all supported normalizing at malaya.preprocessing.get_normalize().
annotate (List[str], optional (default=['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'])) – annonate tokens <open></open>, only accept [‘hashtag’, ‘allcaps’, ‘elongated’, ‘repeated’, ‘emphasis’, ‘censored’].
lowercase (bool, optional (default=True)) –
fix_unidecode (bool, optional (default=True)) – fix unidecode using ftfy.fix_text.
expand_english_contractions (bool, optional (default=True)) – expand english contractions.
segmenter (Callable, optional (default=None)) – function to segmentize word. If provide, it will expand hashtags, #mondayblues == monday blues
demoji (object) – demoji object, need to have a method demoji.
- Returns
result
- Return type
malaya.preprocessing.Preprocessing class
-
malaya.preprocessing.
demoji
()[source]# Download latest emoji malay description from https://github.com/huseinzol05/malay-dataset/tree/master/dictionary/emoji
- Returns
result
- Return type
malaya.preprocessing.Demoji class
malaya.segmentation#
-
malaya.segmentation.
huggingface
(model='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to segmentation.
- Parameters
model (str, optional (default='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased')) – Check available models at malaya.segmentation.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.sentiment#
-
malaya.sentiment.
multinomial
(**kwargs)[source]# Load multinomial sentiment model.
- Returns
result
- Return type
malaya.model.ml.Bayes class
-
malaya.sentiment.
huggingface
(model='mesolitica/sentiment-analysis-nanot5-small-malaysian-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to classify sentiment.
- Parameters
model (str, optional (default='mesolitica/sentiment-analysis-nanot5-small-malaysian-cased')) – Check available models at malaya.sentiment.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.stack#
-
malaya.stack.
voting_stack
(models, text)[source]# Stacking for POS, Entities and Dependency models.
- Parameters
models (list) – list of models.
text (str) – string to predict.
- Returns
result
- Return type
list
-
malaya.stack.
predict_stack
(models, strings, aggregate=<function gmean>, **kwargs)[source]# Stacking for predictive models.
- Parameters
models (List[Callable]) – list of models.
strings (List[str]) –
aggregate (Callable, optional (default=scipy.stats.mstats.gmean)) – Aggregate function.
- Returns
result
- Return type
dict
malaya.stem#
-
malaya.stem.
naive
()[source]# Load stemming model using startswith and endswith naively using regex patterns.
- Returns
result
- Return type
malaya.stem.Naive class
-
malaya.stem.
sastrawi
()[source]# Load stemming model using Sastrawi, this also include lemmatization.
- Returns
result
- Return type
malaya.stem.Sastrawi class
-
malaya.stem.
huggingface
(model='mesolitica/stem-lstm-512', force_check=True, **kwargs)[source]# Load HuggingFace model to stem and lemmatization.
- Parameters
model (str, optional (default='mesolitica/stem-lstm-512')) – Check available models at malaya.stem.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.rnn.Stem
malaya.syllable#
-
malaya.syllable.
rules
(**kwargs)[source]# Load rules based syllable tokenizer. originally from https://github.com/fahadh4ilyas/syllable_splitter/blob/master/SyllableSplitter.py - improved cuaca double vocal ua based on https://en.wikipedia.org/wiki/Comparison_of_Indonesian_and_Standard_Malay#Syllabification - improved rans double consonant ns based on https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1 - improved au and ai double vocal.
- Returns
result
- Return type
malaya.syllable.Tokenizer class
-
malaya.syllable.
huggingface
(model='mesolitica/syllable-lstm', force_check=True, **kwargs)[source]# Load HuggingFace model for syllable tokenizer.
- Parameters
model (str, optional (default='mesolitica/syllable-lstm')) – Check available models at malaya.syllable.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.rnn.Syllable
malaya.tatabahasa#
-
malaya.tatabahasa.
huggingface
(model='mesolitica/finetune-tatabahasa-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to fix kesalahan tatabahasa.
- Parameters
model (str, optional (default='mesolitica/finetune-tatabahasa-t5-small-standard-bahasa-cased')) – Check available models at malaya.tatabahasa.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.tokenizer#
malaya.transformer#
-
malaya.transformer.
huggingface
(model='mesolitica/electra-base-generator-bahasa-cased', **kwargs)[source]# Load transformer model.
- Parameters
model (str, optional (default='mesolitica/electra-base-generator-bahasa-cased')) – Check available models at malaya.transformer.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
malaya.translation#
-
malaya.translation.
word
(model='mesolitica/word-en-ms', **kwargs)[source]# Load word dictionary, based on google translate.
- Parameters
model (
str
) – Check available models at malaya.translation.available_word.(default='mesolitica/word-en-ms') (optional) – Check available models at malaya.translation.available_word.
- Returns
result
- Return type
Dict[str, str]
-
malaya.translation.
huggingface
(model='mesolitica/translation-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to translate.
- Parameters
model (str, optional (default='mesolitica/translation-t5-small-standard-bahasa-cased')) – Check available models at malaya.translation.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.true_case#
-
malaya.true_case.
huggingface
(model='mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to true case.
- Parameters
model (str, optional (default='mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased')) – Check available models at malaya.true_case.available_huggingface.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.word2num#
malaya.wordvector#
malaya.model.extractive_summarization#
-
class
malaya.model.extractive_summarization.
SKLearn
[source]# -
word_level
(corpus, isi_penting=None, window_size=10, important_words=10, **kwargs)[source]# Summarize list of strings / string on word level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
window_size (int, (default=10)) – window size for each word.
important_words (int, (default=10)) – number of important words.
- Returns
dict
- Return type
{‘top-words’, ‘cluster-top-words’, ‘score’}
-
sentence_level
(corpus, isi_penting=None, top_k=3, important_words=10, **kwargs)[source]# Summarize list of strings / string on sentence level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
top_k (int, (default=3)) – number of summarized strings.
important_words (int, (default=10)) – number of important words.
- Returns
dict
- Return type
{‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’}
-
-
class
malaya.model.extractive_summarization.
Doc2Vec
[source]# -
word_level
(corpus, isi_penting=None, window_size=10, aggregation=<function mean>, soft=False, **kwargs)[source]# Summarize list of strings / string on sentence level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
window_size (int, (default=10)) – window size for each word.
aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.
soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.
- Returns
dict
- Return type
{‘score’}
-
sentence_level
(corpus, isi_penting=None, top_k=3, aggregation=<function mean>, soft=False, **kwargs)[source]# Summarize list of strings / string on sentence level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
top_k (int, (default=3)) – number of summarized strings.
aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.
soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.
- Returns
dict
- Return type
{‘summary’, ‘score’}
-
-
class
malaya.model.extractive_summarization.
Encoder
[source]# -
word_level
(corpus, isi_penting=None, window_size=10, important_words=10, batch_size=16, **kwargs)[source]# Summarize list of strings / string on word level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
window_size (int, (default=10)) – window size for each word.
important_words (int, (default=10)) – number of important words.
batch_size (int, (default=16)) – for each feed-forward, we only feed N size of texts for each batch. This to prevent OOM.
- Returns
dict
- Return type
{‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’}
-
sentence_level
(corpus, isi_penting=None, top_k=3, important_words=10, batch_size=16, **kwargs)[source]# Summarize list of strings / string on sentence level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
top_k (int, (default=3)) – number of summarized strings.
important_words (int, (default=10)) – number of important words.
batch_size (int, (default=16)) – for each feed-forward, we only feed N size of texts for each batch. This to prevent OOM.
- Returns
dict
- Return type
{‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’}
-
malaya.model.ml#
-
class
malaya.model.ml.
MulticlassBayes
[source]#
-
class
malaya.model.ml.
BinaryBayes
[source]#
malaya.model.rules#
-
class
malaya.model.rules.
LanguageDict
[source]# -
predict
(words, acceptable_ms_label=['malay', 'ind'], acceptable_en_label=['eng', 'manglish'], ignore_capital=False, use_is_malay=True, predict_mandarin=False)[source]# Predict [EN, MS, OTHERS, CAPITAL, NOT_LANG] on word level. This method assumed the string already tokenized.
- Parameters
words (List[str]) –
acceptable_ms_label (List[str], optional (default = ['malay', 'ind'])) – accept labels from language detection model to assume a word is MS.
acceptable_en_label (List[str], optional (default = ['eng', 'manglish'])) – accept labels from language detection model to assume a word is EN.
ignore_capital (bool, optional (default=False)) – if True, will predict language for capital word.
use_is_malay (bool, optional (default=True)) – if True`, will predict MS word using malaya.dictionary.is_malay, else use language detection model.
predict_mandarin (bool, optional (default=False)) – if True, will slide the string to match pinyin dict.
- Returns
result
- Return type
List[str]
-
malaya.torch_model.gpt2_lm#
malaya.torch_model.huggingface#
-
class
malaya.torch_model.huggingface.
Generator
[source]# -
generate
(strings, return_generate=False, prefix=None, **kwargs)[source]# Generate texts from the input.
- Parameters
strings (List[str]) –
**kwargs (vector arguments pass to huggingface generate method.) –
Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
If you are using use_ctranslate2, vector arguments pass to ctranslate2 translate_batch method. Read more at https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?highlight=translate_batch#ctranslate2.Translator.translate_batch
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Prefix
[source]# -
generate
(string, **kwargs)[source]# Generate texts from the input.
- Parameters
string (str) –
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Paraphrase
[source]# -
generate
(strings, postprocess=True, **kwargs)[source]# Generate texts from the input.
- Parameters
strings (List[str]) –
postprocess (bool, optional (default=False)) – If True, will removed biased generated kata Encik.
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Summarization
[source]# -
generate
(strings, postprocess=True, n=2, threshold=0.1, reject_similarity=0.85, **kwargs)[source]# Generate texts from the input.
- Parameters
strings (List[str]) –
postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed biased generated international news publisher.
n (int, optional (default=2)) – N size of rouge to filter
threshold (float, optional (default=0.1)) – minimum threshold for N rouge score to select a sentence.
reject_similarity (float, optional (default=0.85)) – reject similar sentences while maintain position.
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
ZeroShotClassification
[source]# -
predict_proba
(strings, labels, prefix='ayat ini berkaitan tentang ', multilabel=True)[source]# classify list of strings and return probability.
- Parameters
strings (List[str]) –
labels (List[str]) –
prefix (str, optional (default='ayat ini berkaitan tentang ')) – prefix of labels to zero shot. Playing around with prefix can get better results.
multilabel (bool, optional (default=True)) – probability of labels can be more than 1.0
- Returns
list
- Return type
List[Dict[str, float]]
-
-
class
malaya.torch_model.huggingface.
ExtractiveQA
[source]# -
predict
(paragraph_text, question_texts, validate_answers=True, validate_questions=False, minimum_threshold_question=0.05, **kwargs)[source]# Predict extractive answers from questions given a paragraph.
- Parameters
paragraph_text (str) –
question_texts (List[str]) – List of questions, results really depends on case sensitive questions.
validate_answers (bool, optional (default=True)) – if True, will check the answer is inside the paragraph.
validate_questions (bool, optional (default=False)) – if True, validate the question is subset of the paragraph using sklearn.feature_extraction.text.CountVectorizer it is only useful if paragraph_text and question_texts are the same language.
minimum_threshold_question (float, optional (default=0.05)) – minimum score from cosine_similarity, only useful if validate_questions = True.
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Transformer
[source]# -
vectorize
(strings, method='last', method_token='first', t5_head_logits=True, **kwargs)[source]# Vectorize string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
hidden layers supported. Allowed values:
'last'
- last layer.'first'
- first layer.'mean'
- average all layers.
This only applicable for non T5 models.
method_token (str, optional (default='first')) –
token layers supported. Allowed values:
'last'
- last token.'first'
- first token.'mean'
- average all tokens.
usually pretrained models trained on first token for classification task. This only applicable for non T5 models.
t5_head_logits (str, optional (default=True)) – if True, will take head logits, else, last token. This only applicable for T5 models.
- Returns
result
- Return type
np.array
-
attention
(strings, method='last', method_head='mean', t5_attention='cross_attentions', **kwargs)[source]# Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
method_head (str, optional (default='mean')) –
attention head layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
t5_attention (str, optional (default='cross_attentions')) –
attention type for T5 models. Allowed values:
'cross_attentions'
- cross attention.'encoder_attentions'
- encoder attention.'decoder_attentions'
- decoder attention.
This only applicable for T5 models.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-
-
class
malaya.torch_model.huggingface.
IsiPentingGenerator
[source]# -
generate
(strings, mode='surat-khabar', remove_html_tags=True, **kwargs)[source]# generate a long text given a isi penting.
- Parameters
strings (List[str]) –
mode (str, optional (default='surat-khabar')) –
Mode supported. Allowed values:
'surat-khabar'
- news style writing.'tajuk-surat-khabar'
- headline news style writing.'artikel'
- article style writing.'penerangan-produk'
- product description style writing.'karangan'
- karangan sekolah style writing.
remove_html_tags (bool, optional (default=True)) – Will remove html tags using malaya.text.function.remove_html_tags.
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Tatabahasa
[source]# -
generate
(strings, **kwargs)[source]# Fix kesalahan tatatabahasa.
- Parameters
strings (List[str]) –
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation Fix kesalahan tatabahasa supported all decoding methods except beam.
- Returns
result
- Return type
List[Tuple[str, int]]
-
-
class
malaya.torch_model.huggingface.
Keyword
[source]# -
generate
(strings, top_keywords=5, **kwargs)[source]# Generate texts from the input.
- Parameters
strings (List[str]) –
top_keywords (int, optional (default=5)) –
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Translation
[source]# -
generate
(strings, to_lang='ms', **kwargs)[source]# Generate texts from the input.
- Parameters
strings (List[str]) –
to_lang (str, optional (default='ms')) – target language to translate.
**kwargs (vector arguments pass to huggingface generate method.) –
Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
If you are using use_ctranslate2, vector arguments pass to ctranslate2 translate_batch method. Read more at https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html?highlight=translate_batch#ctranslate2.Translator.translate_batch
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Classification
[source]#
-
class
malaya.torch_model.huggingface.
Tagging
[source]#