API
Contents
API#
malaya#
malaya.alignment.en_ms#
-
malaya.alignment.en_ms.
eflomal
(preprocessing_func=None, **kwargs)[source]# load eflomal word alignment for EN-MS. Model size around ~300MB. :type preprocessing_func:
Optional
[Callable
] :param preprocessing_func: preprocessing function to call during loading prior file.Using malaya.text.function.replace_punct able to reduce ~30% of memory usage.
- Returns
result
- Return type
malaya.alignment.ms_en#
-
malaya.alignment.ms_en.
eflomal
(preprocessing_func=None, **kwargs)[source]# load eflomal word alignment for MS-EN. Model size around ~300MB.
- Parameters
preprocessing_func (Callable, optional (default=None)) – preprocessing function to call during loading prior file. Using malaya.text.function.replace_punct able to reduce ~30% of memory usage.
- Returns
result
- Return type
-
malaya.alignment.ms_en.
huggingface
(model='mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms', force_check=True, **kwargs)[source]# Load huggingface BERT model word alignment for MS-EN, Required Tensorflow >= 2.0.
- Parameters
model (str, optional (default='mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms')) – Check available models at malaya.alignment.ms_en.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.augmentation.abstractive#
-
malaya.augmentation.abstractive.
huggingface
(model='mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v3', lang='ms', force_check=True, **kwargs)[source]# Load HuggingFace model to abstractive text augmentation.
- Parameters
model (str, optional (default='mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v3')) – Check available models at malaya.augmentation.abstractive.available_huggingface().
lang (str, optional (default='ms')) – Input language, only accept ms or en.
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.augmentation.encoder#
-
malaya.augmentation.encoder.
wordvector
(string, wordvector, threshold=0.5, top_n=5, soft=False)[source]# augmenting a string using wordvector.
- Parameters
string (str) – this string input assumed been properly tokenized and cleaned.
wordvector (object) – wordvector interface object.
threshold (float, optional (default=0.5)) – random selection for a word.
soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.
top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.
- Returns
result
- Return type
List[str]
-
malaya.augmentation.encoder.
transformer
(string, model, threshold=0.5, top_p=0.9, top_k=100, temperature=1.0, top_n=5)[source]# augmenting a string using transformer + nucleus sampling / top-k sampling.
- Parameters
string (str) – this string input assumed been properly tokenized and cleaned.
model (object) – transformer interface object. Right now only supported BERT, ALBERT and ELECTRA.
threshold (float, optional (default=0.5)) – random selection for a word.
top_p (float, optional (default=0.8)) – cumulative sum of probabilities to sample a word. If top_n bigger than 0, the model will use nucleus sampling, else top-k sampling.
top_k (int, optional (default=100)) – k for top-k sampling.
temperature (float, optional (default=0.8)) – logits * temperature.
top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.
- Returns
result
- Return type
List[str]
malaya.augmentation.rules#
-
malaya.augmentation.rules.
synonym
(string, threshold=0.5, top_n=5, **kwargs)[source]# augmenting a string using synonym, https://github.com/huseinzol05/Malaya-Dataset#90k-synonym
- Parameters
string (str) – this string input assumed been properly tokenized and cleaned.
threshold (float, optional (default=0.5)) – random selection for a word.
top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.
- Returns
result
- Return type
List[str]
-
malaya.augmentation.rules.
replace_similar_consonants
(word, threshold=0.5, replace_consonants={'b': ['n'], 'd': ['s', 'f'], 'f': ['p'], 'g': ['f', 'h'], 'j': ['k'], 'k': ['l'], 'n': ['m'], 'r': ['t', 'q']})[source]# Naively replace consonants with another consonants to simulate typo or slang if after consonants is a vowel.
- Parameters
word (str) –
threshold (float, optional (default=0.5)) –
- Returns
result
- Return type
List[str]
-
malaya.augmentation.rules.
replace_similar_vowels
(word, threshold=0.5, replace_vowels={'a': ['o'], 'i': ['o'], 'o': ['u'], 'u': ['o']})[source]# Naively replace vowels with another vowels to simulate typo or slang if after vowels is a consonant.
- Parameters
word (str) –
threshold (float, optional (default=0.5)) –
- Returns
result
- Return type
str
augmenting a word into socialmedia form.
- Parameters
word (str) –
- Returns
result
- Return type
List[str]
-
malaya.augmentation.rules.
vowel_alternate
(word, threshold=0.5)[source]# augmenting a word into vowel alternate.
vowel_alternate(‘singapore’) -> sngpore
vowel_alternate(‘kampung’) -> kmpng
vowel_alternate(‘ayam’) -> aym
- Parameters
word (str) –
threshold (float, optional (default=0.5)) –
- Returns
result
- Return type
str
malaya.dictionary#
-
malaya.dictionary.
keyword_wiktionary
(word, acceptable_lang=['brunei malay', 'malay'])[source]# crawl https://en.wiktionary.org/wiki/ to check a word is a malay word.
- Parameters
word (str) –
acceptable_lang (List[str], optional (default=['brunei malay', 'malay'])) – acceptable languages in wiktionary section.
- Returns
result
- Return type
Dict
-
malaya.dictionary.
keyword_dbp
(word, parse=False)[source]# crawl https://prpm.dbp.gov.my/cari1?keyword= to check a word is a malay word.
- Parameters
word (str) –
parse (bool, optional (default=False)) – if True, will parse using BeautifulSoup.
- Returns
result
- Return type
Dict
-
malaya.dictionary.
corpus_dbp
(word)[source]# crawl http://sbmb.dbp.gov.my/korpusdbp/Search2.aspx to search corpus based on a word.
- Parameters
word (str) –
- Returns
result
- Return type
pandas.core.frame.DataFrame
-
malaya.dictionary.
is_english
(word)[source]# Check a word is an english word.
- Parameters
word (str) –
- Returns
result
- Return type
bool
-
malaya.dictionary.
is_malay
(word, stemmer=None)[source]# Check a word is a malay word.
- Parameters
word (str) –
stemmer (Callable, optional (default=None)) – a Callable object, must have stem_word method.
- Returns
result
- Return type
bool
-
malaya.dictionary.
convert_pinyin
(string)[source]# Convert mandarin characters to pinyin form. Original vocab from https://github.com/lxyu/pinyin 你好 -> ni hao
- Parameters
string (str) –
- Returns
result
- Return type
str
malaya.generator.isi_penting#
-
malaya.generator.isi_penting.
transformer
(model='t5', quantized=False, **kwargs)[source]# Load Transformer model to generate a string given a isu penting.
- Parameters
model (str, optional (default='base')) – Check available models at malaya.generator.isi_penting.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.t5.Generator class
-
malaya.generator.isi_penting.
huggingface
(model='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to generate text based on isi penting.
- Parameters
model (str, optional (default='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased')) – Check available models at malaya.generator.isi_penting.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.generator.prefix#
-
malaya.generator.prefix.
babble_tf
(string, model, generate_length=30, leed_out_len=1, temperature=1.0, top_k=100, burnin=15, batch_size=5)[source]# Use pretrained malaya transformer models to generate a string given a prefix string. https://github.com/nyu-dl/bert-gen, https://arxiv.org/abs/1902.04094
- Parameters
string (str) –
model (object) – transformer interface object. Right now only supported BERT, ALBERT and ELECTRA.
generate_length (int, optional (default=256)) – length of sentence to generate.
leed_out_len (int, optional (default=1)) – length of extra masks for each iteration.
temperature (float, optional (default=1.0)) – logits * temperature.
top_k (int, optional (default=100)) – k for top-k sampling.
burnin (int, optional (default=15)) – for the first burnin steps, sample from the entire next word distribution, instead of top_k.
batch_size (int, optional (default=5)) – generate sentences size of batch_size.
- Returns
result
- Return type
List[str]
-
malaya.generator.prefix.
transformer
(model='345M', quantized=False, **kwargs)[source]# Load GPT2 model to generate a string given a prefix string.
- Parameters
model (str, optional (default='345M')) – Check available models at malaya.generator.prefix.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.GPT2 class
-
malaya.generator.prefix.
huggingface
(model='mesolitica/gpt2-117m-bahasa-cased-v2', force_check=True, **kwargs)[source]# Load Prefix language model.
- Parameters
model (str, optional (default='mesolitica/gpt2-117m-bahasa-cased-v2')) – Check available models at malaya.generator.prefix.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.huggingface.Prefix class
malaya.keyword.abstractive#
-
malaya.keyword.abstractive.
huggingface
(model='mesolitica/finetune-keyword-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to abstractive keyword.
- Parameters
model (str, optional (default='mesolitica/finetune-keyword-t5-small-standard-bahasa-cased')) – Check available models at malaya.keyword.abstractive.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.keyword.extractive#
-
malaya.keyword.extractive.
rake
(string, vocab=None, model=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]# Extract keywords using Rake algorithm.
- Parameters
string (str) –
vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.
model (Object, optional (default=None)) – model must has attention method.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] For automatic Ngram generator.
- Returns
result
- Return type
Tuple[float, str]
-
malaya.keyword.extractive.
textrank
(string, vocab=None, model=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]# Extract keywords using Textrank algorithm.
- Parameters
string (str) –
vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.
model (Object, optional (default='None')) – model must has fit_transform or vectorize method.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
Tuple[float, str]
-
malaya.keyword.extractive.
attention
(string, model, vocab=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]# Extract keywords using Attention mechanism.
- Parameters
string (str) –
model (Object) – model must has attention method.
vocab (List[str], optional (default=None)) – List of important substrings. This will override vectorizer parameter.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
Tuple[float, str]
-
malaya.keyword.extractive.
similarity
(string, model, vocab=None, vectorizer=None, top_k=5, atleast=1, stopwords=<function get_stopwords>, **kwargs)[source]# Extract keywords using Sentence embedding VS keyword embedding similarity.
- Parameters
string (str) –
model (Object) – Transformer model or any model has vectorize method.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.text.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
Tuple[float, str]
-
malaya.keyword.extractive.
available_transformer
()[source]# List available transformer keyword similarity model.
-
malaya.keyword.extractive.
transformer
(model='bert', quantized=False, **kwargs)[source]# Load Transformer keyword similarity model.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.keyword.extractive.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.KeyphraseBERT.
if xlnet in model, will return malaya.model.xlnet.KeyphraseXLNET.
- Return type
model
malaya.normalizer.abstractive#
-
malaya.normalizer.abstractive.
huggingface
(model='mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v4', force_check=True, use_rules_normalizer=True, kenlm_model='bahasa-wiki-news', stem_model='noisy', segmenter=None, text_scorer=None, replace_augmentation=True, minlen_speller=2, maxlen_speller=12, **kwargs)[source]# Load HuggingFace model to abstractive text normalizer. text -> rules based text normalizer -> abstractive. To skip rules based text normalizer, set use_rules_normalizer=False.
- Parameters
model (str, optional (default='mesolitica/finetune-noisy-translation-t5-small-bahasa-cased-v4')) – Check available models at malaya.normalizer.abstractive.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
use_rules_normalizer (bool, optional(default=True)) –
kenlm_model (str, optional (default='bahasa-wiki-news')) – the model trained on malaya.language_model.kenlm(model = ‘bahasa-wiki-news’), but you can use any kenlm model from malaya.language_model.available_kenlm. Also you can pass as None to skip spelling correction but still apply rules normalizer. This parameter will be ignored if use_rules_normalizer=False.
stem_model (str, optional (default='noisy')) – the model trained on malaya.stem.deep_model(model = ‘noisy’), but you can use any stemmer model from `malaya.stem.available_model. Also you can pass as None to skip stemming but still apply rules normalizer. This parameter will be ignored if use_rules_normalizer=False.
segmenter (Callable, optional (default=None)) – segmenter function to segment text, read more at https://malaya.readthedocs.io/en/stable/load-normalizer.html#Use-segmenter during training session, we use malaya.segmentation.huggingface(). It is save to set as None. This parameter will be ignored if use_rules_normalizer=False.
text_scorer (Callable, optional (default=None)) – text scorer to validate upper case. during training session, we use malaya.language_model.kenlm(model = ‘bahasa-wiki-news’). This parameter will be ignored if use_rules_normalizer=False.
replace_augmentation (bool, optional (default=True)) – Use replace norvig augmentation method. Enabling this might generate bigger candidates, hence slower. This parameter will be ignored if use_rules_normalizer=False.
minlen_speller (int, optional (default=2)) – minimum length of word to check spelling correction. This parameter will be ignored if use_rules_normalizer=False.
maxlen_speller (int, optional (default=12)) – max length of word to check spelling correction. This parameter will be ignored if use_rules_normalizer=False.
- Returns
result
- Return type
malaya.normalizer.rules#
-
malaya.normalizer.rules.
load
(speller=None, stemmer=None, **kwargs)[source]# Load a Normalizer using any spelling correction model.
- Parameters
speller (Callable, optional (default=None)) – function to correct spelling, must have correct or normalize_elongated method.
stemmer (Callable, optional (default=None)) – function to stem, must have stem_word method. If provide stemmer, will accurately to stem kata imbuhan akhir.
- Returns
result
- Return type
malaya.normalizer.rules.Normalizer class
-
class
malaya.normalizer.rules.
Normalizer
[source]# -
normalize
(string, normalize_text=True, normalize_url=False, normalize_email=False, normalize_year=True, normalize_telephone=True, normalize_date=True, normalize_time=True, normalize_emoji=True, normalize_elongated=True, normalize_hingga=True, normalize_pada_hari_bulan=True, normalize_fraction=True, normalize_money=True, normalize_units=True, normalize_percent=True, normalize_ic=True, normalize_number=True, normalize_x_kali=True, normalize_cardinal=True, normalize_ordinal=True, normalize_entity=True, expand_contractions=True, check_english_func=<function is_english>, check_malay_func=<function is_malay>, translator=None, language_detection_word=None, acceptable_language_detection=['EN', 'CAPITAL', 'NOT_LANG'], segmenter=None, text_scorer=None, text_scorer_window=2, not_a_word_threshold=0.0001, dateparser_settings={'TIMEZONE': 'GMT+8'}, **kwargs)[source]# Normalize a string.
- Parameters
string (str) –
normalize_text (bool, optional (default=True)) – if True, will try to replace shortforms with internal corpus.
normalize_url (bool, optional (default=False)) – if True, replace :// with empty and . with dot. https://huseinhouse.com -> https huseinhouse dot com.
normalize_email (bool, optional (default=False)) – if True, replace @ with di, . with dot. husein.zol05@gmail.com -> husein dot zol kosong lima di gmail dot com.
normalize_year (bool, optional (default=True)) – if True, tahun 1987 -> tahun sembilan belas lapan puluh tujuh. if True, 1970-an -> sembilan belas tujuh puluh an. if False, tahun 1987 -> tahun seribu sembilan ratus lapan puluh tujuh.
normalize_telephone (bool, optional (default=True)) – if True, no 012-1234567 -> no kosong satu dua, satu dua tiga empat lima enam tujuh
normalize_date (bool, optional (default=True)) – if True, 01/12/2001 -> satu disember dua ribu satu. if True, Jun 2017 -> satu Jun dua ribu tujuh belas. if True, 2017 Jun -> satu Jun dua ribu tujuh belas. if False, 2017 Jun -> 01/06/2017. if False, Jun 2017 -> 01/06/2017.
normalize_time (bool, optional (default=True)) – if True, pukul 2.30 -> pukul dua tiga puluh minit. if False, pukul 2.30 -> ‘02:00:00’
normalize_emoji (bool, (default=True)) – if True, 🔥 -> emoji api Load from malaya.preprocessing.demoji.
normalize_elongated (bool, optional (default=True)) – if True, betuii -> betui.
normalize_hingga (bool, optional (default=True)) – if True, 2011 - 2019 -> dua ribu sebelas hingga dua ribu sembilan belas
normalize_pada_hari_bulan (bool, optional (default=True)) – if True, pada 10/4 -> pada sepuluh hari bulan empat
normalize_fraction (bool, optional (default=True)) – if True, 10 /4 -> sepuluh per empat
normalize_money (bool, optional (default=True)) – if True, rm10.4m -> sepuluh juta empat ratus ribu ringgit
normalize_units (bool, optional (default=True)) – if True, 61.2 kg -> enam puluh satu perpuluhan dua kilogram
normalize_percent (bool, optional (default=True)) – if True, 0.8% -> kosong perpuluhan lapan peratus
normalize_ic (bool, optional (default=True)) – if True, 911111-01-1111 -> sembilan satu satu satu satu satu sempang kosong satu sempang satu satu satu satu
normalize_number (bool, optional (default=True)) – if True 0123 -> kosong satu dua tiga
normalize_x_kali (bool, optional (default=True)) – if True 10x -> ‘sepuluh kali’
normalize_cardinal (bool, optional (default=True)) – if True, 123 -> seratus dua puluh tiga
normalize_ordinal (bool, optional (default=True)) – if True, ke-123 -> keseratus dua puluh tiga
normalize_entity (bool, optional (default=True)) – normalize entities, only effect date, datetime, time and money patterns string only.
expand_contractions (bool, optional (default=True)) – expand english contractions.
check_english_func (Callable, optional (default=malaya.text.function.is_english)) – function to check a word in english dictionary, default is malaya.text.function.is_english. this parameter also will be use for malay text normalization.
check_malay_func (Callable, optional (default=malaya.text.function.is_malay)) – function to check a word in malay dictionary, default is malaya.text.function.is_malay.
translator (Callable, optional (default=None)) – function to translate EN word to MS word.
language_detection_word (Callable, optional (default=None)) – function to detect language for each words to get better translation results.
acceptable_language_detection (List[str], optional (default=['EN', 'CAPITAL', 'NOT_LANG'])) – only translate substrings if the results from language_detection_word is in acceptable_language_detection.
segmenter (Callable, optional (default=None)) – function to segmentize word. If provide, it will expand a word, apaitu -> apa itu
text_scorer (Callable, optional (default=None)) – function to validate upper word. If lower case score is higher or equal than upper case score, will choose lower case.
text_scorer_window (int, optional (default=2)) – size of lookback and lookforward to validate upper word.
not_a_word_threshold (float, optional (default=1e-4)) – assume a word is not a human word if score lower than not_a_word_threshold. only usable if passed text_scorer parameter.
dateparser_settings (Dict, optional (default={'TIMEZONE': 'GMT+8'})) – default dateparser setting, check support settings at https://dateparser.readthedocs.io/en/latest/
- Returns
result
- Return type
{‘normalize’, ‘date’, ‘money’}
-
malaya.qa.extractive#
-
malaya.qa.extractive.
transformer
(model='xlnet', quantized=False, **kwargs)[source]# Load Transformer Span model trained on SQUAD V2 dataset.
- Parameters
model (str, optional (default='xlnet')) – Check available models at malaya.qa.extractive.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.SQUAD class
-
malaya.qa.extractive.
huggingface
(model='mesolitica/finetune-qa-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to answer extractive question answers.
- Parameters
model (str, optional (default='mesolitica/finetune-qa-t5-small-standard-bahasa-cased')) – Check available models at malaya.qa.extractive.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.similarity.doc2vec#
-
malaya.similarity.doc2vec.
wordvector
(wv)[source]# Doc2vec interface for text similarity using Word Vector.
- Parameters
wv (object) – malaya.wordvector.WordVector object. should have get_vector_by_name method.
- Returns
result
- Return type
-
malaya.similarity.doc2vec.
vectorizer
(v)[source]# Doc2vec interface for text similarity using Encoder model.
- Parameters
v (object) – encoder interface object, BERT, XLNET. should have vectorize method.
- Returns
result
- Return type
-
class
malaya.similarity.doc2vec.
VectorizerSimilarity
[source]# -
predict_proba
(left_strings, right_strings, similarity='cosine')[source]# calculate similarity for two different batch of texts.
- Parameters
left_strings (list of str) –
right_strings (list of str) –
similarity (str, optional (default='mean')) –
similarity supported. Allowed values:
'cosine'
- cosine similarity.'euclidean'
- euclidean similarity.'manhattan'
- manhattan similarity.
- Returns
result
- Return type
List[float]
-
heatmap
(strings, similarity='cosine', visualize=True, annotate=True, figsize=(7, 7))[source]# plot a heatmap based on output from bert similarity.
- Parameters
strings (list of str) – list of strings.
similarity (str, optional (default='mean')) –
similarity supported. Allowed values:
'cosine'
- cosine similarity.'euclidean'
- euclidean similarity.'manhattan'
- manhattan similarity.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
result – list of results
- Return type
list
-
-
class
malaya.similarity.doc2vec.
Doc2VecSimilarity
[source]# -
predict_proba
(left_strings, right_strings, aggregation=<function mean>, similarity='cosine', soft=False)[source]# calculate similarity for two different batch of texts.
- Parameters
left_strings (list of str) –
right_strings (list of str) –
aggregation (Callable, optional (default=numpy.mean)) –
similarity (str, optional (default='mean')) –
similarity supported. Allowed values:
'cosine'
- cosine similarity.'euclidean'
- euclidean similarity.'manhattan'
- manhattan similarity.
soft (bool, optional (default=False)) – word not inside word vector will replace with nearest word if True, else, will skip.
- Returns
result
- Return type
List[float]
-
heatmap
(strings, aggregation=<function mean>, similarity='cosine', soft=False, visualize=True, annotate=True, figsize=(7, 7))[source]# plot a heatmap based on output from bert similarity.
- Parameters
strings (list of str) – list of strings
aggregation (Callable, optional (default=numpy.mean)) –
similarity (str, optional (default='mean')) –
similarity supported. Allowed values:
'cosine'
- cosine similarity.'euclidean'
- euclidean similarity.'manhattan'
- manhattan similarity.
soft (bool, optional (default=True)) – word not inside word vector will replace with nearest word if True, else, will skip.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
result – list of results.
- Return type
list
-
malaya.similarity.semantic#
malaya.spelling_correction.jamspell#
-
class
malaya.spelling_correction.jamspell.
JamSpell
[source]# -
correct
(word, string, index=- 1)[source]# Correct a word within a text, returning the corrected word.
- Parameters
word (str) –
string (List[str]) – Tokenized string, word must a word inside string.
index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).
- Returns
result
- Return type
str
-
correct_word
(word, string, index=- 1)[source]# Spell-correct word in re.match, and preserve proper upper, lower, title case.
-
correct_match
(match, string, index=- 1)[source]# Spell-correct word in re.match, and preserve proper upper, lower, title case.
-
correct_text
(text)[source]# Correct all the words within a text, returning the corrected text.
- Parameters
text (str) –
- Returns
result
- Return type
str
-
edit_candidates
(word, string, index=- 1)[source]# Generate candidates given a word.
- Parameters
word (str) –
string (str) – Entire string, word must a word inside string.
index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).
- Returns
result
- Return type
List[str]
-
malaya.spelling_correction.probability#
-
class
malaya.spelling_correction.probability.
Spell
[source]# -
-
edit_candidates
(word)[source]# Generate candidates given a word.
- Parameters
word (str) –
- Returns
result
- Return type
List[str]
-
correct_text
(text)[source]# Correct all the words within a text, returning the corrected text.
- Parameters
text (str) –
- Returns
result
- Return type
str
-
-
class
malaya.spelling_correction.probability.
Probability
[source]# The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.
-
class
malaya.spelling_correction.probability.
ProbabilityLM
[source]# The SpellCorrector extends the functionality of the Peter Norvig’s with Language Model. spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation.
-
correct
(word, string, index=- 1, lookback=3, lookforward=3, **kwargs)[source]# Correct a word within a text, returning the corrected word.
- Parameters
word (str) –
string (List[str]) – Entire string, word must a word inside string.
index (int, optional (default=-1)) – index of word in the string, if -1, will try to use string.index(word).
lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
- Returns
result
- Return type
str
-
correct_text
(text, lookback=3, lookforward=3)[source]# Correct all the words within a text, returning the corrected text.
- Parameters
text (str) –
lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
- Returns
result
- Return type
str
-
correct_word
(word, string, index=- 1, lookback=3, lookforward=3)[source]# Spell-correct word, and preserve proper upper, lower and title case.
- Parameters
word (str) –
string (List[str]) – Tokenized string, word must a word inside string.
index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).
lookback (int, optional (default=3)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=3)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
- Returns
result
- Return type
str
-
malaya.spelling_correction.spylls#
malaya.spelling_correction.symspell#
-
class
malaya.spelling_correction.symspell.
Symspell
[source]# The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation
-
edit_step
(word)[source]# Generate candidates given a word.
- Parameters
word (str) –
- Returns
result
- Return type
{candidate1, candidate2}
-
malaya.spelling_correction.transformer#
-
class
malaya.spelling_correction.transformer.
Transformer
[source]# -
correct
(word, string, index=- 1, lookback=5, lookforward=5, batch_size=20, **kwargs)[source]# Correct a word within a text, returning the corrected word.
- Parameters
word (str) –
string (List[str]) – Tokenized string, word must a word inside string.
index (int, optional (default=-1)) – index of word in the string, if -1, will try to use string.index(word).
lookback (int, optional (default=5)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=5)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
batch_size (int, optional (default=20)) – batch size to insert into model.
- Returns
result
- Return type
str
-
correct_text
(text, lookback=5, lookforward=5, batch_size=20)[source]# Correct all the words within a text, returning the corrected text.
- Parameters
text (str) –
lookback (int, optional (default=5)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=5)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
batch_size (int, optional(default=20)) – batch size to insert into model.
- Returns
result
- Return type
str
-
correct_word
(word, string, index=- 1, lookback=5, lookforward=5, batch_size=20)[source]# Spell-correct word, and preserve proper upper, lower and title case.
- Parameters
word (str) –
string (List[str]) – Tokenized string, word must a word inside string.
index (int, optional(default=-1)) – index of word in the string, if -1, will try to use string.index(word).
lookback (int, optional (default=5)) – N words on the left hand side. if put -1, will take all words on the left hand side. longer left hand side will take longer to compute.
lookforward (int, optional (default=5)) – N words on the right hand side. if put -1, will take all words on the right hand side. longer right hand side will take longer to compute.
batch_size (int, optional(default=20)) – batch size to insert into model.
- Returns
result
- Return type
str
-
malaya.summarization.abstractive#
-
malaya.summarization.abstractive.
available_transformer
()[source]# List available transformer models.
-
malaya.summarization.abstractive.
available_huggingface
()[source]# List available huggingface models.
-
malaya.summarization.abstractive.
transformer
(model='small-t5', quantized=False, **kwargs)[source]# Load Malaya transformer encoder-decoder model to generate a summary given a string.
- Parameters
model (str, optional (default='small-t5')) – Check available models at malaya.summarization.abstractive.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if t5 in model, will return malaya.model.t5.Summarization.
if bigbird in model, will return malaya.model.bigbird.Summarization.
if pegasus in model, will return malaya.model.pegasus.Summarization.
- Return type
model
-
malaya.summarization.abstractive.
huggingface
(model='mesolitica/finetune-summarization-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to abstractive summarization.
- Parameters
model (str, optional (default='mesolitica/finetune-summarization-t5-small-standard-bahasa-cased')) – Check available models at malaya.summarization.abstractive.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.summarization.extractive#
-
malaya.summarization.extractive.
encoder
(vectorizer)[source]# Encoder interface for summarization.
- Parameters
vectorizer (object) – encoder interface object, eg, BERT, XLNET, ALBERT, ALXLNET. should have vectorize method.
- Returns
result
- Return type
-
malaya.summarization.extractive.
doc2vec
(wordvector)[source]# Doc2Vec interface for summarization.
- Parameters
wordvector (object) – malaya.wordvector.WordVector object. should have get_vector_by_name method.
- Returns
result
- Return type
-
malaya.summarization.extractive.
sklearn
(model, vectorizer)[source]# sklearn interface for summarization.
- Parameters
model (object) –
Should have fit_transform method. Commonly:
sklearn.decomposition.TruncatedSVD
- LSA algorithm.sklearn.decomposition.LatentDirichletAllocation
- LDA algorithm.
vectorizer (object) –
Should have fit_transform method. Commonly:
sklearn.feature_extraction.text.TfidfVectorizer
- TFIDF algorithm.sklearn.feature_extraction.text.CountVectorizer
- Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramCountVectorizer
- Skip Gram Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramTfidfVectorizer
- Skip Gram TFIDF algorithm.
- Returns
result
- Return type
malaya.text_to_kg.e2e#
-
malaya.text_to_kg.e2e.
huggingface
(model='mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased', **kwargs)[source]# Load HuggingFace model to End-to-End text to knowledge graph.
- Parameters
model (str, optional (default='mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased')) – Check available models at malaya.text_to_kg.e2e.available_huggingface().
- Returns
result
- Return type
malaya.torch_model.huggingface.TexttoKG
malaya.text_to_kg.parser#
-
malaya.text_to_kg.parser.
from_dependency
(tagging, indexing, subjects=[['flat', 'subj', 'nsubj', 'csubj']], relations=[['acl', 'xcomp', 'ccomp', 'obj', 'conj', 'advcl'], ['obj']], objects=[['obj', 'compound', 'flat', 'nmod', 'obl']], got_networkx=True)[source]# Generate knowledge graphs from dependency parsing model. This function been properly curated on top of malaya.dependency.transformer(version = ‘v1’).
- Parameters
tagging (List[Tuple(str, str)]) – tagging result from dependency model.
indexing (List[Tuple(str, str)]) – indexing result from dependency model.
subjects (List[List[str]], optional) – List of dependency labels for subjects.
relations (List[List[str]], optional) – List of dependency labels for relations.
objects (List[List[str]], optional) – List of dependency labels for objects.
got_networkx (bool, optional (default=True)) – If True, will generate networkx.MultiDiGraph.
- Returns
result
- Return type
Dict[result, G]
malaya.topic_model.decomposition#
-
malaya.topic_model.decomposition.
fit
(corpus, model, vectorizer, n_topics, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, **kwargs)[source]# Train a SKlearn model to do topic modelling based on corpus given.
- Parameters
corpus (list) –
model (object) –
Should have fit_transform method. Commonly:
sklearn.decomposition.TruncatedSVD
- LSA algorithm.sklearn.decomposition.LatentDirichletAllocation
- LDA algorithm.sklearn.decomposition.NMF
- NMF algorithm.
vectorizer (object) –
Should have fit_transform method. Commonly:
sklearn.feature_extraction.text.TfidfVectorizer
- TFIDF algorithm.sklearn.feature_extraction.text.CountVectorizer
- Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramCountVectorizer
- Skip Gram Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramTfidfVectorizer
- Skip Gram TFIDF algorithm.
n_topics (int, (default=10)) – size of decomposition column.
cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
malaya.topic_model.decomposition.Topic class
-
class
malaya.topic_model.decomposition.
Topic
[source]# -
visualize_topics
(notebook_mode=False, mds='pcoa')[source]# Print important topics based on decomposition.
- Parameters
mds (str, optional (default='pcoa')) –
2D Decomposition. Allowed values:
'pcoa'
- Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)'mmds'
- Dimension reduction via Multidimensional scaling'tsne'
- Dimension reduction via t-distributed stochastic neighbor embedding
-
top_topics
(len_topic, top_n=10, return_df=True)[source]# Print important topics based on decomposition.
- Parameters
len_topic (int) – size of topics.
top_n (int, optional (default=10)) – top n of each topic.
return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.
-
malaya.topic_model.lda2vec#
-
malaya.topic_model.lda2vec.
fit
(corpus, vectorizer, n_topics=10, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, window_size=2, embedding_size=128, epoch=10, switch_loss=1000, random_state=10, **kwargs)[source]# Train a LDA2Vec model to do topic modelling based on corpus / list of strings given.
- Parameters
corpus (list) –
vectorizer (object) –
Should have fit_transform method. Commonly:
sklearn.feature_extraction.text.TfidfVectorizer
- TFIDF algorithm.sklearn.feature_extraction.text.CountVectorizer
- Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramCountVectorizer
- Skip Gram Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramTfidfVectorizer
- Skip Gram TFIDF algorithm.
n_topics (int, (default=10)) – size of decomposition column.
cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
embedding_size (int, (default=128)) – embedding size of lda2vec tensors.
epoch (int, (default=10)) – training iteration, how many loop need to train.
switch_loss (int, (default=3)) – baseline to switch from document based loss to document + word based loss.
random_state (int, (default=10)) – random_state for sklearn.utils.shuffle parameter
- Returns
result
- Return type
malaya.topic_modeling.lda2vec.DeepTopic class
-
class
malaya.topic_model.lda2vec.
DeepTopic
[source]# -
visualize_topics
(notebook_mode=False, mds='pcoa')[source]# Print important topics based on decomposition.
- Parameters
mds (str, optional (default='pcoa')) –
2D Decomposition. Allowed values:
'pcoa'
- Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)'mmds'
- Dimension reduction via Multidimensional scaling'tsne'
- Dimension reduction via t-distributed stochastic neighbor embedding
-
top_topics
(len_topic, top_n=10, return_df=True)[source]# Print important topics based on decomposition.
- Parameters
len_topic (int) – size of topics.
top_n (int, optional (default=10)) – top n of each topic.
return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.
-
malaya.topic_model.transformer#
-
class
malaya.topic_model.transformer.
AttentionTopic
[source]#
malaya.translation.en_ms#
-
malaya.translation.en_ms.
dictionary
(**kwargs)[source]# Load dictionary {EN: MS} .
- Returns
result
- Return type
Dict[str, str]
-
malaya.translation.en_ms.
transformer
(model='base', quantized=False, **kwargs)[source]# Load transformer encoder-decoder model to translate EN-to-MS.
- Parameters
model (str, optional (default='base')) – Check available models at malaya.translation.en_ms.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bigbird in model, return malaya.model.bigbird.Translation.
else, return malaya.model.tf.Translation.
- Return type
model
-
malaya.translation.en_ms.
huggingface
(model='mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2', force_check=True, **kwargs)[source]# Load HuggingFace model to translate EN-to-MS.
- Parameters
model (str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2')) – Check available models at malaya.translation.en_ms.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.translation.ms_en#
-
malaya.translation.ms_en.
transformer
(model='base', quantized=False, **kwargs)[source]# Load Transformer encoder-decoder model to translate MS-to-EN.
- Parameters
model (str, optional (default='base')) – Check available models at malaya.translation.ms_en.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bigbird in model, return malaya.model.bigbird.Translation.
else, return malaya.model.tf.Translation.
- Return type
model
-
malaya.translation.ms_en.
huggingface
(model='mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2', force_check=True, **kwargs)[source]# Load HuggingFace model to translate MS-to-EN.
- Parameters
model (str, optional (default='mesolitica/finetune-translation-t5-small-standard-bahasa-cased-v2')) – Check available models at malaya.translation.ms_en.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.zero_shot.classification#
-
malaya.zero_shot.classification.
available_transformer
()[source]# List available transformer zero-shot models.
-
malaya.zero_shot.classification.
available_huggingface
()[source]# List available huggingface zero-shot models.
-
malaya.zero_shot.classification.
transformer
(model='bert', quantized=False, **kwargs)[source]# Load Transformer zero-shot model.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.zero_shot.classification.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.ZeroshotBERT.
if xlnet in model, will return malaya.model.xlnet.ZeroshotXLNET.
- Return type
model
-
malaya.zero_shot.classification.
huggingface
(model='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to zeroshot text classification.
- Parameters
model (str, optional (default='mesolitica/finetune-mnli-t5-small-standard-bahasa-cased')) – Check available models at malaya.zero_shot.classification.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.zero_shot.entity#
-
malaya.zero_shot.entity.
huggingface
(model='mesolitica/finetune-zeroshot-ner-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to zeroshot NER.
- Parameters
model (str, optional (default='mesolitica/finetune-zeroshot-ner-t5-small-standard-bahasa-cased')) – Check available models at malaya.zero_shot.entity.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.cluster#
-
malaya.cluster.
cluster_words
(list_words, lowercase=False)[source]# cluster similar words based on structure, eg, [‘mahathir mohamad’, ‘mahathir’] = [‘mahathir mohamad’]. big O = n^2
- Parameters
list_words (List[str]) –
lowercase (bool, optional (default=True)) – if True, will group using lowercase but maintain the original form.
- Returns
string
- Return type
List[str]
-
malaya.cluster.
cluster_pos
(result)[source]# cluster similar POS.
- Parameters
result (List[Tuple[str, str]]) –
- Returns
result
- Return type
Dict[str, List[str]]
-
malaya.cluster.
cluster_entities
(result)[source]# cluster similar Entities.
- Parameters
result (List[Tuple[str, str]]) –
- Returns
result
- Return type
Dict[str, List[str]]
-
malaya.cluster.
cluster_tagging
(result)[source]# cluster any tagging results, as long the data passed [(string, label), (string, label)].
- Parameters
result (List[Tuple[str, str]]) –
- Returns
result
- Return type
Dict[str, List[str]]
-
malaya.cluster.
cluster_scatter
(corpus, vectorizer, num_clusters=5, titles=None, colors=None, stopwords=<function get_stopwords>, cleaning=<function simple_textcleaning>, clustering=<class 'sklearn.cluster._kmeans.KMeans'>, decomposition=<class 'sklearn.manifold._mds.MDS'>, ngram=(1, 3), figsize=(17, 9), batch_size=20)[source]# plot scatter plot on similar text clusters.
- Parameters
corpus (List[str]) –
vectorizer (class) – vectorizer class.
num_clusters (int, (default=5)) – size of unsupervised clusters.
titles (List[str], (default=None)) – list of titles, length must same with corpus.
colors (List[str], (default=None)) – list of colors, length must same with num_clusters.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.
cleaning (function, (default=malaya.texts.function.simple_textcleaning)) – function to clean the corpus.
batch_size (int, (default=10)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.
- Returns
dictionary
- Return type
{‘X’: X, ‘Y’: Y, ‘labels’: clusters, ‘vector’: transformed_text_clean, ‘titles’: titles}
-
malaya.cluster.
cluster_dendogram
(corpus, vectorizer, titles=None, stopwords=<function get_stopwords>, cleaning=<function simple_textcleaning>, random_samples=0.3, ngram=(1, 3), figsize=(17, 9), batch_size=20)[source]# plot hierarchical dendogram with similar texts.
- Parameters
corpus (List[str]) –
vectorizer (class) – vectorizer class.
num_clusters (int, (default=5)) – size of unsupervised clusters.
titles (List[str], (default=None)) – list of titles, length must same with corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.
random_samples (float, (default=0.3)) – random samples from the corpus, 0.3 means 30%.
ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.
batch_size (int, (default=20)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.
- Returns
dictionary
- Return type
{‘linkage_matrix’: linkage_matrix, ‘titles’: titles}
-
malaya.cluster.
cluster_graph
(corpus, vectorizer, threshold=0.9, num_clusters=5, titles=None, colors=None, stopwords=<function get_stopwords>, ngram=(1, 3), cleaning=<function simple_textcleaning>, clustering=<class 'sklearn.cluster._kmeans.KMeans'>, figsize=(17, 9), with_labels=True, batch_size=20)[source]# plot undirected graph with similar texts.
- Parameters
corpus (List[str]) –
vectorizer (class) – vectorizer class.
threshold (float, (default=0.9)) – 0.9 means, 90% above absolute pearson correlation.
num_clusters (int, (default=5)) – size of unsupervised clusters.
titles (List[str], (default=True)) – list of titles, length must same with corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str] or List[str] or Tuple[str].
cleaning (function, (default=malaya.texts.function.simple_textcleaning)) – function to clean the corpus.
ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.
batch_size (int, (default=20)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.
- Returns
dictionary
- Return type
{‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels}
-
malaya.cluster.
cluster_entity_linking
(corpus, vectorizer, entity_model, topic_modeling_model, threshold=0.3, topic_decomposition=2, topic_length=10, fuzzy_ratio=70, accepted_entities=['law', 'location', 'organization', 'person', 'event'], cleaning=<function simple_textcleaning>, colors=None, stopwords=<function get_stopwords>, max_df=1.0, min_df=1, ngram=(2, 3), figsize=(17, 9), batch_size=20)[source]# plot undirected graph for Entities and topics relationship.
- Parameters
corpus (list or str) –
vectorizer (class) –
titles (list) – list of titles, length must same with corpus.
colors (list) – list of colors, length must same with num_clusters.
threshold (float, (default=0.3)) – 0.3 means, 30% above absolute pearson correlation.
topic_decomposition (int, (default=2)) – size of decomposition.
topic_length (int, (default=10)) – size of topic models.
fuzzy_ratio (int, (default=70)) – size of ratio for fuzzywuzzy.
max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str] or List[str] or Tuple[str]
- Returns
dictionary
- Return type
{‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels}
malaya.constituency#
-
malaya.constituency.
transformer
(model='xlnet', quantized=False, **kwargs)[source]# Load Transformer Constituency Parsing model, transfer learning Transformer + self attentive parsing.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.constituency.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.Constituency class
malaya.coref#
-
malaya.coref.
parse_from_dependency
(models, string, references=['dia', 'itu', 'ini', 'saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'], rejected_references=['saya', 'awak', 'kamu', 'kita', 'kami', 'mereka', 'nya'], acceptable_subjects=['flat', 'subj', 'nsubj', 'csubj', 'obj'], acceptable_nested_subjects=['compound', 'flat'], split_nya=True, aggregate=<function mean>, top_k=20)[source]# Apply Coreference Resolution using stacks of dependency models.
Kakak mempunyai kucing. Dia menyayanginya. Dia -> Kakak, nya -> kucing Husein Zolkepli suka makan ayam. Dia pun suka makan daging. Dia -> Husein Zolkepli
- Parameters
models (list) – list of dependency models, must has vectorize method.
string (str) –
references (List[str], optional (default=['dia', 'itu', 'ini', 'saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'])) – list of references.
rejected_references (List[str], optional (default=['saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'])) – list of rejected references during populating subjects.
acceptable_subjects (List[str], optional) – List of dependency labels for subjects.
acceptable_nested_subjects (List[str], optional) – List of dependency labels for nested subjects, eg, syarikat (obl) facebook (compound).
split_nya (bool, optional (default=True)) – split nya, eg, disifatkannya -> disifatkan, nya.
aggregate (Callable, optional (default=numpy.mean)) – Aggregate function to aggregate list of vectors from model.vectorize.
top_k (int, optional (default=20)) – only accept near top_k to assume a coherence.
- Returns
result – {‘text’: [‘Husein’,’Zolkepli’,’suka’,’makan’,’ayam’,’.’,’Dia’,’pun’,’suka’,’makan’,’daging’,’.’], ‘coref’: {6: {‘index’: [0, 1], ‘text’: [‘Husein’, ‘Zolkepli’]}}}
- Return type
Dict[text, coref]
malaya.dependency#
-
malaya.dependency.
dependency_graph
(tagging, indexing)[source]# Return helper object for dependency parser results. Only accept tagging and indexing outputs from dependency models.
-
malaya.dependency.
available_transformer
(version='v2')[source]# List available transformer dependency parsing models.
- Parameters
version (str, optional (default='v2')) –
Version supported. Allowed values:
'v1'
- version 1, maintain for knowledge graph.'v2'
- Trained on bigger dataset, better version.
-
malaya.dependency.
transformer
(version='v2', model='xlnet', quantized=False, **kwargs)[source]# Load Transformer Dependency Parsing model, transfer learning Transformer + biaffine attention.
- Parameters
version (str, optional (default='v2')) –
Version supported. Allowed values:
'v1'
- version 1, maintain for knowledge graph, malaya_graph.text_to_kg.parser.from_dependency'v2'
- Trained on bigger dataset, better version.
model (str, optional (default='xlnet')) – Check available models at malaya.dependency.available_transformer(version=’{version}’).
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.DependencyBERT.
if xlnet in model, will return malaya.model.xlnet.DependencyXLNET.
- Return type
model
-
malaya.dependency.
huggingface
(model='mesolitica/finetune-dependency-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to dependency parsing.
- Parameters
model (str, optional (default='mesolitica/finetune-dependency-t5-small-standard-bahasa-cased')) – Check available models at malaya.dependency.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.huggingface.Dependency
malaya.emotion#
-
malaya.emotion.
multinomial
(**kwargs)[source]# Load multinomial emotion model.
- Returns
result
- Return type
malaya.model.ml.MulticlassBayes class
-
malaya.emotion.
transformer
(model='xlnet', quantized=False, **kwargs)[source]# Load Transformer emotion model.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.emotion.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.MulticlassBERT.
if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET.
if fastformer in model, will return malaya.model.fastformer.MulticlassFastFormer.
- Return type
model
malaya.entity#
-
malaya.entity.
describe_ontonotes5
()[source]# Describe OntoNotes5 Entities supported. https://spacy.io/api/annotation#named-entities
-
malaya.entity.
available_transformer_ontonotes5
()[source]# List available transformer Entity Tagging models trained on Ontonotes 5 Bahasa.
-
malaya.entity.
transformer
(model='xlnet', quantized=False, **kwargs)[source]# Load Transformer Entity Tagging model trained on Malaya Entity, transfer learning Transformer + CRF.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.entity.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.TaggingBERT.
if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.
if fastformer in model, will return malaya.model.fastformer.TaggingFastFormer.
- Return type
model
-
malaya.entity.
transformer_ontonotes5
(model='xlnet', quantized=False, **kwargs)[source]# Load Transformer Entity Tagging model trained on Ontonotes 5 Bahasa, transfer learning Transformer + CRF.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.entity.available_transformer_ontonotes5().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.TaggingBERT.
if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.
if fastformer in model, will return malaya.model.fastformer.TaggingFastFormer.
- Return type
model
-
malaya.entity.
general_entity
(model=None)[source]# Load Regex based general entities tagging along with another supervised entity tagging model.
- Parameters
model (object) – model must have predict method. Make sure the predict method returned [(string, label), (string, label)].
- Returns
result
- Return type
malaya.text.entity.EntityRegex class
malaya.jawi_rumi#
-
malaya.jawi_rumi.
transformer
(model='base', quantized=False, **kwargs)[source]# Load transformer encoder-decoder model to convert jawi to rumi.
- Parameters
model (str, optional (default='base')) – Check available models at malaya.jawi_rumi.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.JawiRumi class
malaya.kg_to_text#
-
malaya.kg_to_text.
huggingface
(model='mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased', **kwargs)[source]# Load HuggingFace model to knowledge graph to text.
- Parameters
model (str, optional (default='mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased')) – Check available models at malaya.kg_to_text.available_huggingface().
- Returns
result
- Return type
malaya.torch_model.huggingface.KGtoText
malaya.language_detection#
-
malaya.language_detection.
fasttext
(quantized=True, **kwargs)[source]# Load Fasttext language detection model. Original size is 353MB, Quantized size 31.1MB.
- Parameters
quantized (bool, optional (default=True)) – if True, load quantized fasttext model. Else, load original fasttext model.
- Returns
result
- Return type
malaya.model.ml.LanguageDetection class
-
malaya.language_detection.
deep_model
(quantized=False, **kwargs)[source]# Load deep learning language detection model. Original size is 51.2MB, Quantized size 12.8MB.
- Parameters
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.DeepLang class
-
malaya.language_detection.
substring_rules
(model, **kwargs)[source]# detect EN, MS, MANDARIN and OTHER languages in a string.
EN words detection are using pyenchant from https://pyenchant.github.io/pyenchant/ and user language detection model.
MS words detection are using malaya.text.function.is_malay and user language detection model.
OTHER words detection are using any language detection classification model, such as, malaya.language_detection.fasttext or malaya.language_detection.deep_model.
- Parameters
model (Callable) – Callable model, must have predict method.
- Returns
result
- Return type
malaya.model.rules.LanguageDict class
malaya.language_model#
-
malaya.language_model.
kenlm
(model='dump-combined', **kwargs)[source]# Load KenLM language model.
- Parameters
model (str, optional (default='dump-combined')) – Check available models at malaya.language_model.available_models().
- Returns
result
- Return type
kenlm.Model class
-
malaya.language_model.
gpt2
(model='mesolitica/gpt2-117m-bahasa-cased', force_check=True, **kwargs)[source]# Load GPT2 language model.
- Parameters
model (str, optional (default='mesolitica/gpt2-117m-bahasa-cased')) – Check available models at malaya.language_model.available_gpt2().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.gpt2_lm.LM class
-
malaya.language_model.
mlm
(model='malay-huggingface/bert-tiny-bahasa-cased', force_check=True, **kwargs)[source]# Load Masked language model.
- Parameters
model (str, optional (default='malay-huggingface/bert-tiny-bahasa-cased')) – Check available models at malaya.language_model.available_mlm().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.torch_model.mask_lm.MLMScorer class
malaya.lexicon#
-
malaya.lexicon.
random_walk
(lexicon, wordvector, pool_size=10, top_n=20, similarity_power=10.0, beta=0.9, arccos=True, normalization=True, soft=False, silent=False)[source]# Induce lexicon by using random walk technique, use in paper, https://arxiv.org/pdf/1606.02820.pdf
- Parameters
lexicon (Dict[str : List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.
wordvector (object) – wordvector interface object.
pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.
top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.
similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.
beta (float, optional (default=0.9)) – penalty score, towards to 1.0 means less penalty. 0 < beta < 1.
arccos (bool, optional (default=True)) – covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.
normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.
soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.
silent (bool, optional (default=False)) – if True, will not print any logs.
- Returns
result
- Return type
tuple(labels[argmax(scores), axis = 1], scores, labels)
-
malaya.lexicon.
propagate_probabilistic
(lexicon, wordvector, pool_size=10, top_n=20, similarity_power=10.0, arccos=True, normalization=True, soft=False, silent=False)[source]# Learns polarity scores via standard label propagation from lexicon sets.
- Parameters
lexicon (Dict[str, List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.
wordvector (object) – wordvector interface object.
pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.
top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.
similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.
arccos (bool, optional (default=True)) – covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.
normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.
soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.
silent (bool, optional (default=False)) – if True, will not print any logs.
- Returns
result
- Return type
tuple(labels[argmax(scores), axis = 1], scores, labels)
-
malaya.lexicon.
propagate_graph
(lexicon, wordvector, pool_size=10, top_n=20, similarity_power=10.0, normalization=True, soft=False, silent=False)[source]# Graph propagation method dapted from Velikovich, Leonid, et al. “The viability of web-derived polarity lexicons.” http://www.aclweb.org/anthology/N10-1119
- Parameters
lexicon (Dict[str, List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.
wordvector (object) – wordvector interface object.
pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.
top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.
similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.
normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.
soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.
silent (bool, optional (default=False)) – if True, will not print any logs.
- Returns
result
- Return type
tuple(labels[argmax(scores), axis = 1], scores, labels)
malaya.nsfw#
malaya.num2word#
-
malaya.num2word.
to_cardinal
(number)[source]# Translate from number input to cardinal text representation
- Parameters
number (real number) –
- Returns
result – cardinal representation
- Return type
str
-
malaya.num2word.
to_ordinal
(number)[source]# Translate from number input to ordinal text representation
- Parameters
number (real number) –
- Returns
result – ordinal representation
- Return type
str
-
malaya.num2word.
to_ordinal_num
(number)[source]# Translate from number input to ordinal numering text representation
- Parameters
number (int) –
- Returns
result – ordinal numering representation
- Return type
str
malaya.paraphrase#
-
malaya.paraphrase.
transformer
(model='small-t5', quantized=False, **kwargs)[source]# Load Malaya transformer encoder-decoder model to paraphrase.
- Parameters
model (str, optional (default='small-t5')) – Check available models at malaya.paraphrase.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if t5 in model, will return malaya.model.t5.Paraphrase.
- Return type
model
-
malaya.paraphrase.
huggingface
(model='mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to paraphrase.
- Parameters
model (str, optional (default='mesolitica/finetune-paraphrase-t5-small-standard-bahasa-cased')) – Check available models at malaya.paraphrase.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.phoneme#
-
malaya.phoneme.
deep_model_dbp
(quantized=False, **kwargs)[source]# Load LSTM + Bahdanau Attention phonetic model, 256 filter size, 2 layers, character level. original data from https://prpm.dbp.gov.my/ Glosari Dialek.
Original size 10.4MB, quantized size 2.77MB .
- Parameters
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.Seq2SeqLSTM class
-
malaya.phoneme.
deep_model_ipa
(quantized=False, **kwargs)[source]# Load LSTM + Bahdanau Attention phonetic model, 256 filter size, 2 layers, character level. Original data from https://github.com/open-dict-data/ipa-dict/blob/master/data/ma.txt
Original size 10.4MB, quantized size 2.77MB .
- Parameters
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.Seq2SeqLSTM_Split class
malaya.pos#
-
malaya.pos.
available_transformer
()[source]# List available transformer Part-Of-Speech Tagging models.
-
malaya.pos.
naive
(string)[source]# Recognize POS in a string using Regex.
- Parameters
string (str) –
- Returns
string
- Return type
List[Tuple[str, str]]
-
malaya.pos.
transformer
(model='xlnet', quantized=False, **kwargs)[source]# Load Transformer POS Tagging model, transfer learning Transformer + CRF.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.pos.available_transformer()`.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.TaggingBERT.
if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.
- Return type
model
malaya.preprocessing#
-
malaya.preprocessing.
preprocessing
(normalize=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'], annotate=['allcaps', 'elongated', 'repeated', 'emphasis', 'censored', 'hashtag'], lowercase=True, fix_unidecode=True, expand_english_contractions=True, segmenter=None, demoji=None, **kwargs)[source]# Load Preprocessing class.
- Parameters
normalize (List[str], optional (default=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'])) – normalizing tokens, can check all supported normalizing at malaya.preprocessing.get_normalize().
annotate (List[str], optional (default=['hashtag', 'allcaps', 'elongated', 'repeated', 'emphasis', 'censored'])) – annonate tokens <open></open>, only accept [‘hashtag’, ‘allcaps’, ‘elongated’, ‘repeated’, ‘emphasis’, ‘censored’].
lowercase (bool, optional (default=True)) –
fix_unidecode (bool, optional (default=True)) – fix unidecode using ftfy.fix_text.
expand_english_contractions (bool, optional (default=True)) – expand english contractions.
segmenter (Callable, optional (default=None)) – function to segmentize word. If provide, it will expand hashtags, #mondayblues == monday blues
demoji (object) – demoji object, need to have a method demoji.
- Returns
result
- Return type
malaya.preprocessing.Preprocessing class
-
malaya.preprocessing.
demoji
()[source]# Download latest emoji malay description from https://github.com/huseinzol05/malay-dataset/tree/master/dictionary/emoji
- Returns
result
- Return type
malaya.preprocessing.Demoji class
malaya.relevancy#
-
malaya.relevancy.
available_transformer
()[source]# List available transformer relevancy analysis models.
-
malaya.relevancy.
transformer
(model='xlnet', quantized=False, **kwargs)[source]# Load Transformer relevancy model.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.relevancy.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.MulticlassBERT.
if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET.
if bigbird in model, will return malaya.model.xlnet.MulticlassBigBird.
if fastformer in model, will return malaya.model.fastformer.MulticlassFastFormer.
- Return type
model
malaya.rumi_jawi#
-
malaya.rumi_jawi.
transformer
(model='base', quantized=False, **kwargs)[source]# Load transformer encoder-decoder model to convert rumi to jawi.
- Parameters
model (str, optional (default='base')) – Check available models at malaya.rumi_jawi.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.RumiJawi class
malaya.segmentation#
-
malaya.segmentation.
viterbi
(max_split_length=20, **kwargs)[source]# Load Segmenter class using viterbi algorithm.
- Parameters
max_split_length (int, (default=20)) – max length of words in a sentence to segment
validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
- Returns
result
- Return type
malaya.segmentation.Segmenter class
-
malaya.segmentation.
transformer
(model='small', quantized=False, **kwargs)[source]# Load transformer encoder-decoder model to segmentation.
- Parameters
model (str, optional (default='base')) – Check available models at malaya.segmentation.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.Segmentation class
-
malaya.segmentation.
huggingface
(model='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to segmentation.
- Parameters
model (str, optional (default='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased')) – Check available models at malaya.segmentation.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.sentiment#
-
malaya.sentiment.
available_transformer
()[source]# List available transformer sentiment analysis models.
-
malaya.sentiment.
multinomial
(**kwargs)[source]# Load multinomial sentiment model.
- Returns
result
- Return type
malaya.model.ml.Bayes class
-
malaya.sentiment.
transformer
(model='bert', quantized=False, **kwargs)[source]# Load Transformer sentiment model.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.sentiment.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.MulticlassBERT.
if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET.
if fastformer in model, will return malaya.model.fastformer.MulticlassFastFormer.
- Return type
model
malaya.stack#
-
malaya.stack.
voting_stack
(models, text)[source]# Stacking for POS, Entities and Dependency models.
- Parameters
models (list) – list of models.
text (str) – string to predict.
- Returns
result
- Return type
list
-
malaya.stack.
predict_stack
(models, strings, aggregate=<function gmean>, **kwargs)[source]# Stacking for predictive models.
- Parameters
models (List[Callable]) – list of models.
strings (List[str]) –
aggregate (Callable, optional (default=scipy.stats.mstats.gmean)) – Aggregate function.
- Returns
result
- Return type
dict
malaya.stem#
-
malaya.stem.
naive
()[source]# Load stemming model using startswith and endswith naively using regex patterns.
- Returns
result
- Return type
malaya.stem.Naive class
-
malaya.stem.
sastrawi
()[source]# Load stemming model using Sastrawi, this also include lemmatization.
- Returns
result
- Return type
malaya.stem.Sastrawi class
-
malaya.stem.
deep_model
(model='base', quantized=False, **kwargs)[source]# Load LSTM + Bahdanau Attention stemming model, BPE level (YouTokenToMe 1000 vocab size). This model also include lemmatization.
- Parameters
model (str, optional (default='base')) – Check available models at malaya.stem.available_deep_model().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.stem.DeepStemmer class
-
class
malaya.stem.
DeepStemmer
[source]# -
greedy_decoder
(string)[source]# Stem a string, this also include lemmatization using greedy decoder.
- Parameters
string (str) –
- Returns
result
- Return type
str
-
beam_decoder
(string)[source]# Stem a string, this also include lemmatization using beam decoder.
- Parameters
string (str) –
- Returns
result
- Return type
str
-
predict
(string, beam_search=False)[source]# Stem a string, this also include lemmatization.
- Parameters
string (str) –
beam_search (bool, (optional=False)) – If True, use beam search decoder, else use greedy decoder.
- Returns
result
- Return type
str
-
malaya.subjectivity#
-
malaya.subjectivity.
available_transformer
()[source]# List available transformer subjective analysis models.
-
malaya.subjectivity.
multinomial
(**kwargs)[source]# Load multinomial subjectivity model.
- Parameters
validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
- Returns
result
- Return type
malaya.model.ml.Bayes class
-
malaya.subjectivity.
transformer
(model='bert', quantized=False, **kwargs)[source]# Load Transformer subjectivity model.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.subjectivity.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.BinaryBERT.
if xlnet in model, will return malaya.model.xlnet.BinaryXLNET.
if fastformer in model, will return malaya.model.fastformer.BinaryFastFormer.
- Return type
model
malaya.syllable#
-
malaya.syllable.
rules
(**kwargs)[source]# Load rules based syllable tokenizer. originally from https://github.com/fahadh4ilyas/syllable_splitter/blob/master/SyllableSplitter.py - improved cuaca double vocal ua based on https://en.wikipedia.org/wiki/Comparison_of_Indonesian_and_Standard_Malay#Syllabification - improved rans double consonant ns based on https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1 - improved au and ai double vocal.
- Returns
result
- Return type
malaya.syllable.Tokenizer class
-
malaya.syllable.
deep_model
(model='base', quantized=False, **kwargs)[source]# Load LSTM + Bahdanau Attention syllable tokenizer model, BPE level (YouTokenToMe 300 vocab size).
- Parameters
model (str, optional (default='base')) – Check available models at malaya.syllable.available_deep_model().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.syllable.DeepSyllable class
-
class
malaya.syllable.
Tokenizer
[source]# -
tokenize
(string)[source]# Tokenize string into multiple strings using syllable patterns. Example from https://www.semanticscholar.org/paper/Syllabification-algorithm-based-on-syllable-rules-Musa-Kadir/a819f255f066ae0fd7a30b3534de41da37d04ea1/figure/0, ‘cuaca’ -> [‘cua’, ‘ca’] ‘insurans’ -> [‘in’, ‘su’, ‘rans’] ‘praktikal’ -> [‘prak’, ‘ti’, ‘kal’] ‘strategi’ -> [‘stra’, ‘te’, ‘gi’] ‘ayam’ -> [‘a’, ‘yam’] ‘anda’ -> [‘an’, ‘da’] ‘hantu’ -> [‘han’, ‘tu’]
- Parameters
string (str) –
- Returns
result
- Return type
List[str]
-
malaya.tatabahasa#
-
malaya.tatabahasa.
describe_tagging
()[source]# Describe kesalahan tatabahasa supported. Full description at https://tatabahasabm.tripod.com/tata/salahtata.htm
-
malaya.tatabahasa.
transformer
(model='base', quantized=False, **kwargs)[source]# Load Malaya transformer encoder-decoder + tagging model to correct a kesalahan tatabahasa text.
- Parameters
model (str, optional (default='base')) – Check available models at malaya.tatabahasa.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.Tatabahasa class
-
malaya.tatabahasa.
huggingface
(model='mesolitica/finetune-tatabahasa-t5-small-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to fix kesalahan tatabahasa.
- Parameters
model (str, optional (default='mesolitica/finetune-tatabahasa-t5-small-standard-bahasa-cased')) – Check available models at malaya.tatabahasa.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.tokenizer#
malaya.toxicity#
-
malaya.toxicity.
available_transformer
()[source]# List available transformer toxicity analysis models.
-
malaya.toxicity.
multinomial
(**kwargs)[source]# Load multinomial toxicity model.
- Returns
result
- Return type
malaya.model.ml.MultilabelBayes class
-
malaya.toxicity.
transformer
(model='xlnet', quantized=False, **kwargs)[source]# Load Transformer toxicity model.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.'fastformer'
- FastFormer BASE parameters.'tiny-fastformer'
- FastFormer TINY parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.SigmoidBERT.
if xlnet in model, will return malaya.model.xlnet.SigmoidXLNET.
if fastformer in model, will return malaya.model.fastformer.SigmoidFastFormer.
- Return type
model
malaya.transformer#
-
malaya.transformer.
load
(model='electra', pool_mode='last', **kwargs)[source]# Load transformer model.
- Parameters
model (str, optional (default='bert')) – Check available models at malaya.transformer.available_transformer().
pool_mode (str, optional (default='last')) –
Model logits architecture supported. Only usable if model in [‘xlnet’, ‘alxlnet’]. Allowed values:
'last'
- last of the sequence.'first'
- first of the sequence.'mean'
- mean of the sequence.'attn'
- attention of the sequence.
- Returns
result – List of model classes:
if bert in model, will return malaya.transformers.bert.Model.
if xlnet in model, will return malaya.transformers.xlnet.Model.
if albert in model, will return malaya.transformers.albert.Model.
if electra in model, will return malaya.transformers.electra.Model.
- Return type
model
-
malaya.transformer.
huggingface
(model='mesolitica/electra-base-generator-bahasa-cased', force_check=True, **kwargs)[source]# Load transformer model.
- Parameters
model (str, optional (default='mesolitica/electra-base-generator-bahasa-cased')) – Check available models at malaya.transformer.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
malaya.true_case#
-
malaya.true_case.
transformer
(model='base', quantized=False, **kwargs)[source]# Load transformer encoder-decoder model to True Case.
- Parameters
model (str, optional (default='base')) – Check available models at malaya.true_case.available_transformer().
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.TrueCase class
-
malaya.true_case.
huggingface
(model='mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased', force_check=True, **kwargs)[source]# Load HuggingFace model to true case.
- Parameters
model (str, optional (default='mesolitica/finetune-true-case-t5-tiny-standard-bahasa-cased')) – Check available models at malaya.true_case.available_huggingface().
force_check (bool, optional (default=True)) – Force check model one of malaya model. Set to False if you have your own huggingface model.
- Returns
result
- Return type
malaya.word2num#
malaya.wordvector#
-
malaya.wordvector.
load
(model='wikipedia', **kwargs)[source]# Return malaya.wordvector.WordVector object.
- Parameters
model (str, optional (default='wikipedia')) – Check available models at malaya.wordvector.available_wordvector().
- Returns
vocabulary (indices dictionary for vector.)
vector (np.array, 2D.)
-
class
malaya.wordvector.
WordVector
(embed_matrix, dictionary, **kwargs)[source]# -
get_vector_by_name
(word, soft=False, topn_soft=5)[source]# get vector based on string.
- Parameters
word (str) –
soft (bool, (default=True)) – if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary.
topn_soft (int, (default=5)) – if word not found in dictionary, will returned topn_soft size of similar size using jarowinkler.
- Returns
vector
- Return type
np.array, 1D
-
tree_plot
(labels, figsize=(7, 7), annotate=True)[source]# plot a tree plot based on output from calculator / n_closest / analogy.
- Parameters
labels (list) – output from calculator / n_closest / analogy.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
embed (np.array, 2D.)
labelled (labels for X / Y axis.)
-
scatter_plot
(labels, centre=None, figsize=(7, 7), plus_minus=25, handoff=5e-05)[source]# plot a scatter plot based on output from calculator / n_closest / analogy.
- Parameters
labels (list) – output from calculator / n_closest / analogy
centre (str, (default=None)) – centre label, if a str, it will annotate in a red color.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
tsne
- Return type
np.array, 2D.
-
batch_calculator
(equations, num_closest=5, return_similarity=False)[source]# batch calculator parser for word2vec using tensorflow.
- Parameters
equations (list of str) – Eg, ‘[(mahathir + najib) - rosmah]’
num_closest (int, (default=5)) – number of words closest to the result.
- Returns
word_list
- Return type
list of nearest words
-
calculator
(equation, num_closest=5, metric='cosine', return_similarity=True)[source]# calculator parser for word2vec.
- Parameters
equation (str) – Eg, ‘(mahathir + najib) - rosmah’
num_closest (int, (default=5)) – number of words closest to the result.
metric (str, (default='cosine')) – vector distance algorithm.
return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.
- Returns
word_list
- Return type
list of nearest words
-
batch_n_closest
(words, num_closest=5, return_similarity=False, soft=True)[source]# find nearest words based on a batch of words using Tensorflow.
- Parameters
words (list) – Eg, [‘najib’,’anwar’]
num_closest (int, (default=5)) – number of words closest to the result.
return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.
soft (bool, (default=True)) – if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary.
- Returns
word_list
- Return type
list of nearest words
-
n_closest
(word, num_closest=5, metric='cosine', return_similarity=True)[source]# find nearest words based on a word.
- Parameters
word (str) – Eg, ‘najib’
num_closest (int, (default=5)) – number of words closest to the result.
metric (str, (default='cosine')) – vector distance algorithm.
return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.
- Returns
word_list
- Return type
list of nearest words
-
analogy
(a, b, c, num=1, metric='cosine')[source]# analogy calculation, vb - va + vc.
- Parameters
a (str) –
b (str) –
c (str) –
num (int, (default=1)) –
metric (str, (default='cosine')) – vector distance algorithm.
- Returns
word_list
- Return type
list of nearest words.
-
project_2d
(start, end)[source]# project word2vec into 2d dimension.
- Parameters
start (int) –
end (int) –
- Returns
embed_2d (TSNE decomposition)
word_list (words in between start and end.)
-
network
(word, num_closest=8, depth=4, min_distance=0.5, iteration=300, figsize=(15, 15), node_color='#72bbd0', node_factor=50)[source]# plot a social network based on word given
- Parameters
word (str) – centre of social network.
num_closest (int, (default=8)) – number of words closest to the node.
depth (int, (default=4)) – depth of social network. More deeper more expensive to calculate, big^O(num_closest ** depth).
min_distance (float, (default=0.5)) – minimum distance among nodes. Increase the value to increase the distance among nodes.
iteration (int, (default=300)) – number of loops to train the social network to fit min_distace.
figsize (tuple, (default=(15, 15))) – figure size for plot.
node_color (str, (default='#72bbd0')) – color for nodes.
node_factor (int, (default=10)) – size factor for depth nodes. Increase this value will increase nodes sizes based on depth.
- Returns
G
- Return type
networkx graph object
-
malaya.model.alignment#
-
class
malaya.model.alignment.
Eflomal
[source]# -
align
(source, target, model=3, score_model=0, n_samplers=3, length=1.0, null_prior=0.2, lowercase=True, debug=False, **kwargs)[source]# align text using eflomal, https://github.com/robertostling/eflomal/blob/master/align.py
- Parameters
source (List[str]) –
target (List[str]) –
model (int, optional (default=3)) – Model (1 = IBM1, 2 = IBM1+HMM, 3 = IBM1+HMM+fertility).
score_model (int, optional (default=0)) – (1 = IBM1, 2 = IBM1+HMM, 3 = IBM1+HMM+fertility).
n_samplers (int, optional (default=3)) – Number of independent samplers to run.
length (float, optional (default=1.0)) – Relative number of sampling iterations.
null_prior (float, optional (default=0.2)) – Prior probability of NULL alignment.
lowercase (bool, optional (default=True)) – lowercase during searching priors.
debug (bool, optional (default=False)) – debug eflomal binary.
- Returns
result
- Return type
Dict[List[List[Tuple]]]
-
-
class
malaya.model.alignment.
HuggingFace
[source]# -
align
(source, target, align_layer=8, threshold=0.001)[source]# align text using softmax output layers.
- Parameters
source (List[str]) –
target (List[str]) –
align_layer (int, optional (default=3)) – transformer layer-k to choose for embedding output.
threshold (float, optional (default=1e-3)) – minimum probability to assume as alignment.
- Returns
result
- Return type
List[List[Tuple]]
-
malaya.model.bert#
-
class
malaya.model.bert.
BinaryBERT
[source]# -
vectorize
(strings, method='first')[source]# vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings, add_neutral=True)[source]# classify list of strings.
- Parameters
strings (List[str]) –
add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
- Returns
result
- Return type
List[str]
-
predict_proba
(strings, add_neutral=True)[source]# classify list of strings and return probability.
- Parameters
strings (List[str]) –
add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string, method='last', bins_size=0.05, visualization=True, **kwargs)[source]# classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
bins_size (float, optional (default=0.05)) – default bins size for word distribution histogram.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
result
- Return type
dict
-
-
class
malaya.model.bert.
MulticlassBERT
[source]# -
vectorize
(strings, method='first')[source]# vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings)[source]# classify list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
predict_proba
(strings)[source]# classify list of strings and return probability.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string, method='last', bins_size=0.05, visualization=True, **kwargs)[source]# classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
bins_size (float, optional (default=0.05)) – default bins size for word distribution histogram.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
result
- Return type
dict
-
-
class
malaya.model.bert.
SigmoidBERT
[source]# -
vectorize
(strings, method='first')[source]# vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
-
class
malaya.model.bert.
SiameseBERT
[source]# -
vectorize
(strings)[source]# Vectorize list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
predict_proba
(strings_left, strings_right)[source]# calculate similarity for two different batch of texts.
- Parameters
strings_left (List[str]) –
strings_right (List[str]) –
- Returns
list
- Return type
list of float
-
heatmap
(strings, visualize=True, annotate=True, figsize=(7, 7))[source]# plot a heatmap based on output from similarity
- Parameters
strings (list of str) – list of strings.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
result – list of results
- Return type
list
-
-
class
malaya.model.bert.
TaggingBERT
[source]# -
vectorize
(string)[source]# vectorize a string.
- Parameters
string (List[str]) –
- Returns
result
- Return type
np.array
-
-
class
malaya.model.bert.
DependencyBERT
[source]#
-
class
malaya.model.bert.
ZeroshotBERT
[source]# -
vectorize
(strings, labels, method='first')[source]# vectorize a string.
- Parameters
strings (List[str]) –
labels (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
malaya.model.bigbird#
-
class
malaya.model.bigbird.
MulticlassBigBird
[source]# -
vectorize
(strings, method='first')[source]# vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
-
class
malaya.model.bigbird.
Summarization
[source]# -
greedy_decoder
(strings, temperature=0.0, postprocess=False, **kwargs)[source]# Summarize strings using greedy decoder.
- Parameters
strings (List[str]) –
temperature (float, (default=0.0)) – logits * -log(random.uniform) * temperature.
postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.
- Returns
result
- Return type
List[str]
-
nucleus_decoder
(strings, top_p=0.7, temperature=0.1, postprocess=False, **kwargs)[source]# Summarize strings using nucleus decoder.
- Parameters
strings (List[str]) –
top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p.
temperature (float, (default=0.3)) – logits * -log(random.uniform) * temperature.
postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.
- Returns
result
- Return type
List[str]
-
malaya.model.extractive_summarization#
-
class
malaya.model.extractive_summarization.
SKLearn
[source]# -
word_level
(corpus, isi_penting=None, window_size=10, important_words=10, **kwargs)[source]# Summarize list of strings / string on word level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
window_size (int, (default=10)) – window size for each word.
important_words (int, (default=10)) – number of important words.
- Returns
dict
- Return type
{‘top-words’, ‘cluster-top-words’, ‘score’}
-
sentence_level
(corpus, isi_penting=None, top_k=3, important_words=10, **kwargs)[source]# Summarize list of strings / string on sentence level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
top_k (int, (default=3)) – number of summarized strings.
important_words (int, (default=10)) – number of important words.
- Returns
dict
- Return type
{‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’}
-
-
class
malaya.model.extractive_summarization.
Doc2Vec
[source]# -
word_level
(corpus, isi_penting=None, window_size=10, aggregation=<function mean>, soft=False, **kwargs)[source]# Summarize list of strings / string on sentence level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
window_size (int, (default=10)) – window size for each word.
aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.
soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.
- Returns
dict
- Return type
{‘score’}
-
sentence_level
(corpus, isi_penting=None, top_k=3, aggregation=<function mean>, soft=False, **kwargs)[source]# Summarize list of strings / string on sentence level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
top_k (int, (default=3)) – number of summarized strings.
aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.
soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.
- Returns
dict
- Return type
{‘summary’, ‘score’}
-
malaya.model.huggingface#
malaya.model.ml#
-
class
malaya.model.ml.
MulticlassBayes
[source]#
-
class
malaya.model.ml.
BinaryBayes
[source]#
malaya.model.pegasus#
-
class
malaya.model.pegasus.
Summarization
[source]# -
greedy_decoder
(strings, temperature=0.0, postprocess=False, **kwargs)[source]# Summarize strings using greedy decoder.
- Parameters
strings (List[str]) –
temperature (float, (default=0.3)) – logits * -log(random.uniform) * temperature.
postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.
- Returns
result
- Return type
List[str]
-
nucleus_decoder
(strings, top_p=0.7, temperature=0.2, postprocess=False, **kwargs)[source]# Summarize strings using nucleus decoder.
- Parameters
strings (List[str]) –
top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p.
temperature (float, (default=0.3)) – logits * -log(random.uniform) * temperature.
postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.
- Returns
result
- Return type
List[str]
-
malaya.model.rules#
-
class
malaya.model.rules.
LanguageDict
[source]# -
predict
(words, acceptable_ms_label=['malay', 'ind'], acceptable_en_label=['eng', 'manglish'], ignore_capital=False, use_is_malay=True, predict_mandarin=False)[source]# Predict [EN, MS, OTHERS, CAPITAL, NOT_LANG] on word level. This method assumed the string already tokenized.
- Parameters
words (List[str]) –
acceptable_ms_label (List[str], optional (default = ['malay', 'ind'])) – accept labels from language detection model to assume a word is MS.
acceptable_en_label (List[str], optional (default = ['eng', 'manglish'])) – accept labels from language detection model to assume a word is EN.
ignore_capital (bool, optional (default=False)) – if True, will predict language for capital word.
use_is_malay (bool, optional (default=True)) – if True`, will predict MS word using malaya.dictionary.is_malay, else use language detection model.
predict_mandarin (bool, optional (default=False)) – if True, will slide the string to match pinyin dict.
- Returns
result
- Return type
List[str]
-
malaya.model.t5#
malaya.model.tf#
-
class
malaya.model.tf.
DeepLang
[source]#
-
class
malaya.model.tf.
Translation
[source]# -
greedy_decoder
(strings)[source]# translate list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
beam_decoder
(strings, beam_size=3, temperature=0.5)[source]# translate list of strings using beam decoder. Currently only noisy models supported beam_size and temperature parameters.
- Parameters
strings (List[str]) –
beam_size (int, optional (default=3)) –
temperature (float, optional (default=0.5)) –
- Returns
result
- Return type
List[str]
-
-
class
malaya.model.tf.
Constituency
[source]# -
vectorize
(string)[source]# vectorize a string.
- Parameters
string (List[str]) –
- Returns
result
- Return type
np.array
-
-
class
malaya.model.tf.
TrueCase
[source]#
-
class
malaya.model.tf.
Segmentation
[source]#
-
class
malaya.model.tf.
Paraphrase
[source]# -
greedy_decoder
(strings, **kwargs)[source]# Paraphrase strings using greedy decoder.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
-
class
malaya.model.tf.
SQUAD
[source]# -
predict
(paragraph_text, question_texts, doc_stride=128, max_query_length=64, max_answer_length=64, n_best_size=20)[source]# Predict Span from questions given a paragraph.
- Parameters
paragraph_text (str) –
question_texts (List[str]) – List of questions, results really depends on case sensitive questions.
doc_stride (int, optional (default=128)) – striding size to split a paragraph into multiple texts.
max_query_length (int, optional (default=64)) – Maximum length if question tokens.
max_answer_length (int, optional (default=30)) – Maximum length if answer tokens.
- Returns
result
- Return type
List[{‘text’: ‘text’, ‘start’: 0, ‘end’: 1}]
-
vectorize
(strings, method='first')[source]# vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
-
class
malaya.model.tf.
GPT2
[source]# -
generate
(string, maxlen=256, n_samples=1, temperature=1.0, top_k=0, top_p=0.0)[source]# generate a text given an initial string.
- Parameters
string (str) –
maxlen (int, optional (default=256)) – length of sentence to generate.
n_samples (int, optional (default=1)) – size of output.
temperature (float, optional (default=1.0)) – temperature value, value should between 0 and 1.
top_k (int, optional (default=0)) – top-k in nucleus sampling selection.
top_p (float, optional (default=0.0)) – top-p in nucleus sampling selection, value should between 0 and 1. if top_p == 0, will use top_k. if top_p == 0 and top_k == 0, use greedy decoder.
- Returns
result
- Return type
List[str]
-
-
class
malaya.model.tf.
Seq2SeqLSTM
[source]# -
greedy_decoder
(strings)[source]# Convert to target strings using greedy decoder.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
-
class
malaya.model.tf.
JawiRumi
[source]# -
greedy_decoder
(strings)[source]# Convert list of jawi strings to rumi strings. ‘ايسو بيل تنب دباوا ك كابينيت - صيفالدين’ -> ‘isu bil tnb dibawa ke kabinet - saifuddin’
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
beam_decoder
(strings, beam_size=3, temperature=0.5)[source]# Convert list of jawi strings to rumi strings. ‘ايسو بيل تنب دباوا ك كابينيت - صيفالدين’ -> ‘isu bil tnb dibawa ke kabinet - saifuddin’
- Parameters
strings (List[str]) –
beam_size (int, optional (default=3)) –
temperature (float, optional (default=0.5)) –
- Returns
result
- Return type
List[str]
-
-
class
malaya.model.tf.
RumiJawi
[source]# -
greedy_decoder
(strings)[source]# Convert list of jawi strings to rumi strings. ‘isu bil tnb dibawa ke kabinet - saifuddin’ -> ‘ايسو بيل تنب دباوا ك كابينيت - صيفالدين’
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
beam_decoder
(strings, beam_size=3, temperature=0.5)[source]# Convert list of jawi strings to rumi strings. ‘isu bil tnb dibawa ke kabinet - saifuddin’ -> ‘ايسو بيل تنب دباوا ك كابينيت - صيفالدين’
- Parameters
strings (List[str]) –
beam_size (int, optional (default=3)) –
temperature (float, optional (default=0.5)) –
- Returns
result
- Return type
List[str]
-
malaya.model.xlnet#
-
class
malaya.model.xlnet.
BinaryXLNET
[source]# -
vectorize
(strings, method='first')[source]# vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings, add_neutral=True)[source]# classify list of strings.
- Parameters
strings (List[str]) –
add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
- Returns
result
- Return type
List[str]
-
predict_proba
(strings, add_neutral=True)[source]# classify list of strings and return probability.
- Parameters
strings (List[str]) –
add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string, method='last', bins_size=0.05, visualization=True, **kwargs)[source]# classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
bins_size (float, optional (default=0.05)) – default bins size for word distribution histogram.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
result
- Return type
dict
-
-
class
malaya.model.xlnet.
MulticlassXLNET
[source]# -
vectorize
(strings, method='first')[source]# vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings)[source]# classify list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
predict_proba
(strings)[source]# classify list of strings and return probability.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string, method='last', bins_size=0.05, visualization=True, **kwargs)[source]# classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
bins_size (float, optional (default=0.05)) – default bins size for word distribution histogram.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
result
- Return type
dict
-
-
class
malaya.model.xlnet.
SigmoidXLNET
[source]# -
vectorize
(strings, method='first')[source]# vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings)[source]# classify list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[List[str]]
-
predict_proba
(strings)[source]# classify list of strings and return probability.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string, method='last', bins_size=0.05, visualization=True, **kwargs)[source]# classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
bins_size (float, optional (default=0.05)) – default bins size for word distribution histogram.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
dictionary
- Return type
results
-
-
class
malaya.model.xlnet.
SiameseXLNET
[source]# -
vectorize
(strings)[source]# Vectorize list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
predict_proba
(strings_left, strings_right)[source]# calculate similarity for two different batch of texts.
- Parameters
string_left (List[str]) –
string_right (List[str]) –
- Returns
result
- Return type
List[float]
-
heatmap
(strings, visualize=True, annotate=True, figsize=(7, 7))[source]# plot a heatmap based on output from similarity
- Parameters
strings (list of str) – list of strings.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
result – list of results
- Return type
list
-
-
class
malaya.model.xlnet.
TaggingXLNET
[source]# -
vectorize
(string)[source]# vectorize a string.
- Parameters
string (List[str]) –
- Returns
result
- Return type
np.array
-
-
class
malaya.model.xlnet.
DependencyXLNET
[source]#
-
class
malaya.model.xlnet.
ZeroshotXLNET
[source]# -
vectorize
(strings, labels, method='first')[source]# vectorize a string.
- Parameters
strings (List[str]) –
labels (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
malaya.torch_model.huggingface#
-
class
malaya.torch_model.huggingface.
Generator
[source]# -
generate
(strings, return_generate=False, prefix=None, **kwargs)[source]# Generate texts from the input.
- Parameters
strings (List[str]) –
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Prefix
[source]# -
generate
(string, **kwargs)[source]# Generate texts from the input.
- Parameters
string (str) –
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Paraphrase
[source]# -
generate
(strings, postprocess=True, **kwargs)[source]# Generate texts from the input.
- Parameters
strings (List[str]) –
postprocess (bool, optional (default=False)) – If True, will removed biased generated kata Encik.
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Summarization
[source]# -
generate
(strings, postprocess=True, n=2, threshold=0.1, reject_similarity=0.85, **kwargs)[source]# Generate texts from the input.
- Parameters
strings (List[str]) –
postprocess (bool, optional (default=False)) – If True, will filter sentence generated using ROUGE score and removed biased generated international news publisher.
n (int, optional (default=2)) – N size of rouge to filter
threshold (float, optional (default=0.1)) – minimum threshold for N rouge score to select a sentence.
reject_similarity (float, optional (default=0.85)) – reject similar sentences while maintain position.
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
ZeroShotClassification
[source]# -
predict_proba
(strings, labels, prefix='ayat ini berkaitan tentang', multilabel=True)[source]# classify list of strings and return probability.
- Parameters
strings (List[str]) –
labels (List[str]) –
prefix (str, optional (default='ayat ini berkaitan tentang')) – prefix of labels to zero shot. Playing around with prefix can get better results.
multilabel (bool, optional (default=True)) – probability of labels can be more than 1.0
- Returns
list
- Return type
List[Dict[str, float]]
-
-
class
malaya.torch_model.huggingface.
ZeroShotNER
[source]# -
predict
(string, tags, minimum_length=2, **kwargs)[source]# classify entities in a string.
- Parameters
strings (str) – We assumed the string input been properly tokenized.
tags (List[str]) –
minimum_length (int, optional (default=2)) – minimum length of string for an entity.
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
list
- Return type
Dict[str, List[str]]
-
-
class
malaya.torch_model.huggingface.
ExtractiveQA
[source]# -
predict
(paragraph_text, question_texts, validate_answers=True, validate_questions=False, minimum_threshold_question=0.05, **kwargs)[source]# Predict extractive answers from questions given a paragraph.
- Parameters
paragraph_text (str) –
question_texts (List[str]) – List of questions, results really depends on case sensitive questions.
validate_answers (bool, optional (default=True)) – if True, will check the answer is inside the paragraph.
validate_questions (bool, optional (default=False)) – if True, validate the question is subset of the paragraph using sklearn.feature_extraction.text.CountVectorizer it is only useful if paragraph_text and question_texts are the same language.
minimum_threshold_question (float, optional (default=0.05)) – minimum score from cosine_similarity, only useful if validate_questions = True.
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Transformer
[source]# -
vectorize
(strings, method='last', method_token='first', t5_head_logits=True, **kwargs)[source]# Vectorize string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
hidden layers supported. Allowed values:
'last'
- last layer.'first'
- first layer.'mean'
- average all layers.
This only applicable for non T5 models.
method_token (str, optional (default='first')) –
token layers supported. Allowed values:
'last'
- last token.'first'
- first token.'mean'
- average all tokens.
usually pretrained models trained on first token for classification task. This only applicable for non T5 models.
t5_head_logits (str, optional (default=True)) – if True, will take head logits, else, last token. This only applicable for T5 models.
- Returns
result
- Return type
np.array
-
attention
(strings, method='last', method_head='mean', t5_attention='cross_attentions', **kwargs)[source]# Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
method_head (str, optional (default='mean')) –
attention head layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
t5_attention (str, optional (default='cross_attentions')) –
attention type for T5 models. Allowed values:
'cross_attentions'
- cross attention.'encoder_attentions'
- encoder attention.'decoder_attentions'
- decoder attention.
This only applicable for T5 models.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-
-
class
malaya.torch_model.huggingface.
IsiPentingGenerator
[source]# -
generate
(strings, mode='surat-khabar', remove_html_tags=True, **kwargs)[source]# generate a long text given a isi penting.
- Parameters
strings (List[str]) –
mode (str, optional (default='surat-khabar')) –
Mode supported. Allowed values:
'surat-khabar'
- news style writing.'tajuk-surat-khabar'
- headline news style writing.'artikel'
- article style writing.'penerangan-produk'
- product description style writing.'karangan'
- karangan sekolah style writing.
remove_html_tags (bool, optional (default=True)) – Will remove html tags using malaya.text.function.remove_html_tags.
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Tatabahasa
[source]# -
generate
(strings, **kwargs)[source]# Fix kesalahan tatatabahasa.
- Parameters
strings (List[str]) –
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation Fix kesalahan tatabahasa supported all decoding methods except beam.
- Returns
result
- Return type
List[Tuple[str, int]]
-
-
class
malaya.torch_model.huggingface.
Normalizer
[source]# -
generate
(strings, **kwargs)[source]# abstractive text normalization.
- Parameters
strings (List[str]) –
**kwargs (vector arguments pass to huggingface generate method.) –
Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
Also vector arguments pass to malaya.normalizer.rules.Normalizer.normalize
- Returns
result
- Return type
List[str]
-
-
class
malaya.torch_model.huggingface.
Keyword
[source]# -
generate
(strings, top_keywords=5, **kwargs)[source]# Generate texts from the input.
- Parameters
strings (List[str]) –
top_keywords (int, optional (default=5)) –
**kwargs (vector arguments pass to huggingface generate method.) – Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
- Returns
result
- Return type
List[str]
-
malaya.torch_model.mask_lm#
malaya.transformers.albert#
-
malaya.transformers.albert.
load
(model='albert', **kwargs)[source]# Load albert model.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'albert'
- base albert-bahasa released by Malaya.'tiny-albert'
- tiny bert-bahasa released by Malaya.
- Returns
result
- Return type
malaya.transformers.albert.Model class
-
class
malaya.transformers.albert.
Model
[source]# -
vectorize
(strings, **kwargs)[source]# Vectorize string inputs.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
attention
(strings, method='last', **kwargs)[source]# Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-
malaya.transformers.alxlnet#
-
malaya.transformers.alxlnet.
load
(model='alxlnet', pool_mode='last', **kwargs)[source]# Load alxlnet model.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'alxlnet'
- XLNET architecture from google + Malaya.
pool_mode (str, optional (default='last')) –
Model logits architecture supported. Allowed values:
'last'
- last of the sequence.'first'
- first of the sequence.'mean'
- mean of the sequence.'attn'
- attention of the sequence.
- Returns
result
- Return type
malaya.transformers.alxlnet.Model class
-
class
malaya.transformers.alxlnet.
Model
[source]# -
vectorize
(strings, **kwargs)[source]# Vectorize string inputs.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
attention
(strings, method='last', **kwargs)[source]# Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-
malaya.transformers.bert#
-
malaya.transformers.bert.
load
(model='base', **kwargs)[source]# Load bert model.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'bert'
- base bert-bahasa released by Malaya.'tiny-bert'
- tiny bert-bahasa released by Malaya.
- Returns
result
- Return type
malaya.transformers.bert.Model class
-
class
malaya.transformers.bert.
Model
[source]# -
vectorize
(strings, **kwargs)[source]# Vectorize string inputs.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
attention
(strings, method='last', **kwargs)[source]# Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-
malaya.transformers.electra#
-
malaya.transformers.electra.
load
(model='electra', **kwargs)[source]# Load electra model.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'electra'
- base electra-bahasa released by Malaya.'small-electra'
- small electra-bahasa released by Malaya.
- Returns
result
- Return type
malaya.transformers.electra.Model class
-
class
malaya.transformers.electra.
Model
[source]# -
vectorize
(strings, **kwargs)[source]# Vectorize string inputs.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
attention
(strings, method='last', **kwargs)[source]# Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-
malaya.transformers.xlnet#
-
malaya.transformers.xlnet.
load
(model='xlnet', pool_mode='last', **kwargs)[source]# Load xlnet model.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'xlnet'
- XLNET architecture from google.
pool_mode (str, optional (default='last')) –
Model logits architecture supported. Allowed values:
'last'
- last of the sequence.'first'
- first of the sequence.'mean'
- mean of the sequence.'attn'
- attention of the sequence.
- Returns
result
- Return type
malaya.transformers.xlnet.Model class
-
class
malaya.transformers.xlnet.
Model
[source]# -
vectorize
(strings, **kwargs)[source]# Vectorize string inputs.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
attention
(strings, method='last', **kwargs)[source]# Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-