API¶
malaya¶
-
malaya.
available_gpu
()[source]¶ Get list of GPUs from nvidia-smi.
- Returns
result
- Return type
List[str]
-
malaya.
print_cache
(location=None)[source]¶ Print cached data, this will print entire cache folder if let location = None.
- Parameters
location (str, (default=None)) – if location is None, will print entire cache directory.
-
malaya.
clear_cache
(location)[source]¶ Remove selected cached data, please run malaya.print_cache() to get path.
- Parameters
location (str) –
- Returns
result
- Return type
boolean
-
malaya.
clear_session
(model)[source]¶ Clear session from a model to prevent any out-of-memory or segmentation fault issues.
- Parameters
model (malaya object.) –
- Returns
result
- Return type
boolean
malaya.augmentation¶
-
malaya.augmentation.
synonym
(string: str, threshold: float = 0.5, top_n=5, cleaning=<function augmentation_textcleaning>, **kwargs)[source]¶ augmenting a string using synonym, https://github.com/huseinzol05/Malaya-Dataset#90k-synonym
- Parameters
string (str) –
threshold (float, optional (default=0.5)) – random selection for a word.
top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.
cleaning (function, (default=malaya.text.function.augmentation_textcleaning)) – function to clean text.
- Returns
result
- Return type
List[str]
-
malaya.augmentation.
wordvector
(string: str, wordvector, threshold: float = 0.5, top_n: int = 5, soft: bool = False, cleaning=<function augmentation_textcleaning>)[source]¶ augmenting a string using wordvector.
- Parameters
string (str) –
wordvector (object) – wordvector interface object.
threshold (float, optional (default=0.5)) – random selection for a word.
soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.
top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.
cleaning (function, (default=malaya.text.function.augmentation_textcleaning)) – function to clean text.
- Returns
result
- Return type
List[str]
-
malaya.augmentation.
transformer
(string: str, model, threshold: float = 0.5, top_p: float = 0.9, top_k: int = 100, temperature: float = 1.0, top_n: int = 5, cleaning=None)[source]¶ augmenting a string using transformer + nucleus sampling / top-k sampling.
- Parameters
string (str) –
model (object) – transformer interface object. Right now only supported BERT, ALBERT and ELECTRA.
threshold (float, optional (default=0.5)) – random selection for a word.
top_p (float, optional (default=0.8)) – cumulative sum of probabilities to sample a word. If top_n bigger than 0, the model will use nucleus sampling, else top-k sampling.
top_k (int, optional (default=100)) – k for top-k sampling.
temperature (float, optional (default=0.8)) – logits * temperature.
top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.
cleaning (function, (default=None)) – function to clean text.
- Returns
result
- Return type
List[str]
malaya.cluster¶
-
malaya.cluster.
cluster_words
(list_words: List[str], lowercase: bool = False)[source]¶ cluster similar words based on structure, eg, [‘mahathir mohamad’, ‘mahathir’] = [‘mahathir mohamad’]. big O = n^2
- Parameters
list_words (List[str]) –
lowercase (bool, optional (default=True)) – if True, will group using lowercase but maintain the original form.
- Returns
string
- Return type
List[str]
-
malaya.cluster.
cluster_pos
(result: List[Tuple[str, str]])[source]¶ cluster similar POS.
- Parameters
result (List[Tuple[str, str]]) –
- Returns
result
- Return type
Dict[str, List[str]]
-
malaya.cluster.
cluster_entities
(result: List[Tuple[str, str]])[source]¶ cluster similar Entities.
- Parameters
result (List[Tuple[str, str]]) –
- Returns
result
- Return type
Dict[str, List[str]]
-
malaya.cluster.
cluster_tagging
(result: List[Tuple[str, str]])[source]¶ cluster any tagging results, as long the data passed [(string, label), (string, label)].
- Parameters
result (List[Tuple[str, str]]) –
- Returns
result
- Return type
Dict[str, List[str]]
-
malaya.cluster.
cluster_scatter
(corpus: List[str], vectorizer, num_clusters: int = 5, titles: List[str] = None, colors: List[str] = None, stopwords=<function get_stopwords>, cleaning=<function simple_textcleaning>, clustering=<class 'sklearn.cluster._kmeans.KMeans'>, decomposition=<class 'sklearn.manifold._mds.MDS'>, ngram: Tuple[int, int] = (1, 3), figsize: Tuple[int, int] = (17, 9), batch_size: int = 20)[source]¶ plot scatter plot on similar text clusters.
- Parameters
corpus (List[str]) –
vectorizer (class) – vectorizer class.
num_clusters (int, (default=5)) – size of unsupervised clusters.
titles (List[str], (default=None)) – list of titles, length must same with corpus.
colors (List[str], (default=None)) – list of colors, length must same with num_clusters.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.
cleaning (function, (default=malaya.texts.function.simple_textcleaning)) – function to clean the corpus.
batch_size (int, (default=10)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.
- Returns
dictionary
- Return type
{‘X’: X, ‘Y’: Y, ‘labels’: clusters, ‘vector’: transformed_text_clean, ‘titles’: titles}
-
malaya.cluster.
cluster_dendogram
(corpus: List[str], vectorizer, titles: List[str] = None, stopwords=<function get_stopwords>, cleaning=<function simple_textcleaning>, random_samples: float = 0.3, ngram: Tuple[int, int] = (1, 3), figsize: Tuple[int, int] = (17, 9), batch_size: int = 20)[source]¶ plot hierarchical dendogram with similar texts.
- Parameters
corpus (List[str]) –
vectorizer (class) – vectorizer class.
num_clusters (int, (default=5)) – size of unsupervised clusters.
titles (List[str], (default=None)) – list of titles, length must same with corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.
random_samples (float, (default=0.3)) – random samples from the corpus, 0.3 means 30%.
ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.
batch_size (int, (default=20)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.
- Returns
dictionary
- Return type
{‘linkage_matrix’: linkage_matrix, ‘titles’: titles}
-
malaya.cluster.
cluster_graph
(corpus: List[str], vectorizer, threshold: float = 0.9, num_clusters: int = 5, titles: List[str] = None, colors: List[str] = None, stopwords=<function get_stopwords>, ngram: Tuple[int, int] = (1, 3), cleaning=<function simple_textcleaning>, clustering=<class 'sklearn.cluster._kmeans.KMeans'>, figsize: Tuple[int, int] = (17, 9), with_labels: bool = True, batch_size: int = 20)[source]¶ plot undirected graph with similar texts.
- Parameters
corpus (List[str]) –
vectorizer (class) – vectorizer class.
threshold (float, (default=0.9)) – 0.9 means, 90% above absolute pearson correlation.
num_clusters (int, (default=5)) – size of unsupervised clusters.
titles (List[str], (default=True)) – list of titles, length must same with corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str] or List[str] or Tuple[str].
cleaning (function, (default=malaya.texts.function.simple_textcleaning)) – function to clean the corpus.
ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.
batch_size (int, (default=20)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.
- Returns
dictionary
- Return type
{‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels}
-
malaya.cluster.
cluster_entity_linking
(corpus: List[str], vectorizer, entity_model, topic_modeling_model, threshold: float = 0.3, topic_decomposition: int = 2, topic_length: int = 10, fuzzy_ratio: int = 70, accepted_entities: List[str] = ['law', 'location', 'organization', 'person', 'event'], cleaning=<function simple_textcleaning>, colors: Optional[List[str]] = None, stopwords=<function get_stopwords>, max_df: float = 1.0, min_df: int = 1, ngram: Tuple[int, int] = (2, 3), figsize: Tuple[int, int] = (17, 9), batch_size: int = 20)[source]¶ plot undirected graph for Entities and topics relationship.
- Parameters
corpus (list or str) –
vectorizer (class) –
titles (list) – list of titles, length must same with corpus.
colors (list) – list of colors, length must same with num_clusters.
threshold (float, (default=0.3)) – 0.3 means, 30% above absolute pearson correlation.
topic_decomposition (int, (default=2)) – size of decomposition.
topic_length (int, (default=10)) – size of topic models.
fuzzy_ratio (int, (default=70)) – size of ratio for fuzzywuzzy.
max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str] or List[str] or Tuple[str]
- Returns
dictionary
- Return type
{‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels}
malaya.constituency¶
-
malaya.constituency.
transformer
(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]¶ Load Transformer Constituency Parsing model, transfer learning Transformer + self attentive parsing.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.Constituency class
malaya.dependency¶
-
malaya.dependency.
dependency_graph
(tagging, indexing)[source]¶ Return helper object for dependency parser results. Only accept tagging and indexing outputs from dependency models.
-
malaya.dependency.
available_transformer
()[source]¶ List available transformer dependency parsing models.
-
malaya.dependency.
transformer
(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]¶ Load Transformer Dependency Parsing model, transfer learning Transformer + biaffine attention.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.DependencyBERT.
if xlnet in model, will return malaya.model.xlnet.DependencyXLNET.
- Return type
model
malaya.emotion¶
-
malaya.emotion.
multinomial
(**kwargs)[source]¶ Load multinomial emotion model.
- Returns
result
- Return type
malaya.model.ml.BAYES class
-
malaya.emotion.
transformer
(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]¶ Load Transformer emotion model.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.MulticlassBERT.
if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET.
- Return type
model
malaya.entity¶
-
malaya.entity.
describe_ontonotes5
()[source]¶ Describe OntoNotes5 Entities supported. https://spacy.io/api/annotation#named-entities
-
malaya.entity.
available_transformer_ontonotes5
()[source]¶ List available transformer Entity Tagging models trained on Ontonotes 5 Bahasa.
-
malaya.entity.
transformer
(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]¶ Load Transformer Entity Tagging model, transfer learning Transformer + CRF.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.TaggingBERT.
if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.
- Return type
model
-
malaya.entity.
transformer_ontonotes5
(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]¶ Load Transformer Entity Tagging model trained on Ontonotes 5 Bahasa, transfer learning Transformer + CRF.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.TaggingBERT.
if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.
- Return type
model
-
malaya.entity.
general_entity
(model=None)[source]¶ Load Regex based general entities tagging along with another supervised entity tagging model.
- Parameters
model (object) – model must have predict method. Make sure the predict method returned [(string, label), (string, label)].
- Returns
result
- Return type
malaya.text.entity.EntityRegex class
malaya.generator¶
-
malaya.generator.
ngrams
(sequence, n: int, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None)[source]¶ generate ngrams.
- Parameters
sequence (List[str]) – list of tokenize words.
n (int) – ngram size
- Returns
result
- Return type
List[Tuple[str, str]]
-
malaya.generator.
pos_entities_ngram
(result_pos: List[Tuple[str, str]], result_entities: List[Tuple[str, str]], ngram: Tuple[int, int] = (1, 3), accept_pos: List[str] = ['NOUN', 'PROPN', 'VERB'], accept_entities: List[str] = ['law', 'location', 'organization', 'person', 'time'])[source]¶ generate ngrams.
- Parameters
result_pos (List[Tuple[str, str]]) – result from POS recognition.
result_entities (List[Tuple[str, str]]) – result of Entities recognition.
ngram (Tuple[int, int]) – ngram sizes.
accept_pos (List[str]) – accepted POS elements.
accept_entities (List[str]) – accept entities elements.
- Returns
result
- Return type
list
-
malaya.generator.
sentence_ngram
(sentence: str, ngram: Tuple[int, int] = (1, 3))[source]¶ generate ngram for a text
- Parameters
sentence (str) –
ngram (tuple) – ngram sizes.
- Returns
result
- Return type
list
-
malaya.generator.
shortform
(word: str, augment_vowel: bool = True, augment_consonant: bool = True, prob_delete_vowel: float = 0.5, **kwargs)[source]¶ augmenting a formal word into socialmedia form. Purposely typo, purposely delete some vowels, purposely replaced some subwords into slang subwords.
- Parameters
word (str) –
augment_vowel (bool, (default=True)) – if True, will augment vowels for each samples generated.
augment_consonant (bool, (default=True)) – if True, will augment consonants for each samples generated.
prob_delete_vowel (float, (default=0.5)) – probability to delete a vowel.
- Returns
result
- Return type
list
-
malaya.generator.
babble
(string: str, model, generate_length: int = 30, leed_out_len: int = 1, temperature: float = 1.0, top_k: int = 100, burnin: int = 15, batch_size: int = 5)[source]¶ Use pretrained transformer models to generate a string given a prefix string. https://github.com/nyu-dl/bert-gen, https://arxiv.org/abs/1902.04094
- Parameters
string (str) –
model (object) – transformer interface object. Right now only supported BERT, ALBERT.
generate_length (int, optional (default=256)) – length of sentence to generate.
leed_out_len (int, optional (default=1)) – length of extra masks for each iteration.
temperature (float, optional (default=1.0)) – logits * temperature.
top_k (int, optional (default=100)) – k for top-k sampling.
burnin (int, optional (default=15)) – for the first burnin steps, sample from the entire next word distribution, instead of top_k.
batch_size (int, optional (default=5)) – generate sentences size of batch_size.
- Returns
result
- Return type
List[str]
-
malaya.generator.
gpt2
(model: str = '345M', generate_length: int = 256, temperature: float = 1.0, top_k: int = 40, **kwargs)[source]¶ Load GPT2 model to generate a string given a prefix string.
- Parameters
model (str, optional (default='345M')) –
Model architecture supported. Allowed values:
'117M'
- GPT2 117M parameters.'345M'
- GPT2 345M parameters.
generate_length (int, optional (default=256)) – length of sentence to generate.
temperature (float, optional (default=1.0)) – temperature value, value should between 0 and 1.
top_k (int, optional (default=40)) – top-k in nucleus sampling selection.
- Returns
result
- Return type
malaya.transformers.gpt2.Model class
-
malaya.generator.
transformer
(model: str = 't5', quantized: bool = False, **kwargs)[source]¶ Load Transformer model to generate a string given a isu penting.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
't5'
- T5 BASE parameters.'small-t5'
- T5 SMALL parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if t5 in model, will return malaya.model.t5.Generator.
- Return type
model
malaya.keyword_extraction¶
-
malaya.keyword_extraction.
rake
(string: str, model=None, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=<function get_stopwords>, **kwargs)[source]¶ Extract keywords using Rake algorithm.
- Parameters
string (str) –
model (Object, optional (default=None)) – Transformer model or any model has attention method.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
ngram (tuple, optional (default=(1,1))) – n-grams size.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] For automatic Ngram generator.
- Returns
result
- Return type
Tuple[float, str]
-
malaya.keyword_extraction.
textrank
(string: str, model=None, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=<function get_stopwords>, **kwargs)[source]¶ Extract keywords using Textrank algorithm.
- Parameters
string (str) –
model (Object, optional (default='None')) – model has fit_transform or vectorize method.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
Tuple[float, str]
-
malaya.keyword_extraction.
attention
(string: str, model, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=<function get_stopwords>, **kwargs)[source]¶ Extract keywords using Attention mechanism.
- Parameters
string (str) –
model (Object) – Transformer model or any model has attention method.
vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.
top_k (int, optional (default=5)) – return top-k results.
atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
Tuple[float, str]
malaya.language_detection¶
-
malaya.language_detection.
fasttext
(quantized: bool = True, **kwargs)[source]¶ Load Fasttext language detection model.
- Parameters
quantized (bool, optional (default=True)) – if True, load quantized fasttext model. Else, load original fasttext model.
- Returns
result
- Return type
malaya.model.ml.LanguageDetection class
-
malaya.language_detection.
deep_model
(**kwargs)[source]¶ Load deep learning language detection model.
- Returns
result
- Return type
malaya.model.tf.DeepLang class
malaya.lexicon¶
-
malaya.lexicon.
random_walk
(lexicon: Dict[str, List[str]], wordvector, pool_size: int = 10, top_n: int = 20, similarity_power: float = 10.0, beta: float = 0.9, arccos: bool = True, normalization: bool = True, soft: bool = False, silent: bool = False)[source]¶ Induce lexicon by using random walk technique, use in paper, https://arxiv.org/pdf/1606.02820.pdf
- Parameters
lexicon (Dict[str : List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.
wordvector (object) – wordvector interface object.
pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.
top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.
similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.
beta (float, optional (default=0.9)) – penalty score, towards to 1.0 means less penalty. 0 < beta < 1.
arccos (bool, optional (default=True)) – covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.
normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.
soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.
silent (bool, optional (default=False)) – if True, will not print any logs.
- Returns
result
- Return type
tuple(labels[argmax(scores), axis = 1], scores, labels)
-
malaya.lexicon.
propagate_probabilistic
(lexicon: Dict[str, List[str]], wordvector, pool_size: int = 10, top_n: int = 20, similarity_power: float = 10.0, arccos: bool = True, normalization: bool = True, soft: bool = False, silent: bool = False)[source]¶ Learns polarity scores via standard label propagation from lexicon sets.
- Parameters
lexicon (Dict[str, List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.
wordvector (object) – wordvector interface object.
pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.
top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.
similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.
arccos (bool, optional (default=True)) – covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.
normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.
soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.
silent (bool, optional (default=False)) – if True, will not print any logs.
- Returns
result
- Return type
tuple(labels[argmax(scores), axis = 1], scores, labels)
-
malaya.lexicon.
propagate_graph
(lexicon: Dict[str, List[str]], wordvector, pool_size: int = 10, top_n: int = 20, similarity_power: float = 10.0, normalization: bool = True, soft: bool = False, silent: bool = False)[source]¶ Graph propagation method dapted from Velikovich, Leonid, et al. “The viability of web-derived polarity lexicons.” http://www.aclweb.org/anthology/N10-1119
- Parameters
lexicon (Dict[str, List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.
wordvector (object) – wordvector interface object.
pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.
top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.
similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.
normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.
soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.
silent (bool, optional (default=False)) – if True, will not print any logs.
- Returns
result
- Return type
tuple(labels[argmax(scores), axis = 1], scores, labels)
malaya.normalize¶
-
malaya.normalize.
normalizer
(speller=None, **kwargs)[source]¶ Load a Normalizer using any spelling correction model.
- Parameters
speller (spelling correction object, optional (default = None)) –
- Returns
result
- Return type
malaya.normalize.Normalizer class
-
class
malaya.normalize.
Normalizer
[source]¶ -
normalize
(string: str, check_english: bool = True, normalize_text: bool = True, normalize_entity: bool = True, normalize_url: bool = False, normalize_email: bool = False, normalize_year: bool = True, normalize_telephone: bool = True, logging: bool = False)[source]¶ Normalize a string.
- Parameters
string (str) –
check_english (bool, (default=True)) – check a word in english dictionary.
normalize_text (bool, (default=True)) – if True, will try to replace shortforms with internal corpus.
normalize_entity (bool, (default=True)) – normalize entities, only effect date, datetime, time and money patterns string only.
normalize_url (bool, (default=False)) – if True, replace :// with empty and . with dot. https://huseinhouse.com -> https huseinhouse dot com.
normalize_email (bool, (default=False)) – if True, replace @ with di, . with dot. husein.zol05@gmail.com -> husein dot zol kosong lima di gmail dot com.
normalize_year (bool, (default=True)) – if True, tahun 1987 -> tahun sembilan belas lapan puluh tujuh. if True, 1970-an -> sembilan belas tujuh puluh an. if False, tahun 1987 -> tahun seribu sembilan ratus lapan puluh tujuh.
normalize_telephone (bool, (default=True)) – if True, no 012-1234567 -> no kosong satu dua, satu dua tiga empat lima enam tujuh
logging (bool, (default=False)) – if True, will log index and token queue using logging.warn.
- Returns
string
- Return type
normalized string
-
malaya.nsfw¶
-
malaya.nsfw.
lexicon
(**kwargs)[source]¶ Load Lexicon NSFW model.
- Returns
result
- Return type
malaya.text.lexicon.nsfw.Lexicon class
-
malaya.nsfw.
multinomial
(**kwargs)[source]¶ Load multinomial NSFW model.
- Returns
result
- Return type
malaya.model.ml.BAYES class
malaya.num2word¶
-
malaya.num2word.
to_cardinal
(number)[source]¶ Translate from number input to cardinal text representation
- Parameters
number (real number) –
- Returns
result – cardinal representation
- Return type
str
-
malaya.num2word.
to_ordinal
(number)[source]¶ Translate from number input to ordinal text representation
- Parameters
number (real number) –
- Returns
result – ordinal representation
- Return type
str
-
malaya.num2word.
to_ordinal_num
(number)[source]¶ Translate from number input to ordinal numering text representation
- Parameters
number (int) –
- Returns
result – ordinal numering representation
- Return type
str
-
malaya.num2word.
to_currency
(value)[source]¶ Translate from number input to cardinal currency text representation
- Parameters
number (int) –
- Returns
result – cardinal currency representation
- Return type
str
-
malaya.num2word.
to_year
(value)[source]¶ Translate from number input to cardinal year text representation
- Parameters
number (int) –
- Returns
result – cardinal year representation
- Return type
str
malaya.paraphrase¶
-
malaya.paraphrase.
transformer
(model: str = 't2t', quantized: bool = False, **kwargs)[source]¶ Load Malaya transformer encoder-decoder model to generate a paraphrase given a string.
- Parameters
model (str, optional (default='t2t')) –
Model architecture supported. Allowed values:
't2t'
- Malaya Transformer BASE parameters.'small-t2t'
- Malaya Transformer SMALL parameters.'t5'
- T5 BASE parameters.'small-t5'
- T5 SMALL parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if t2t in model, will return malaya.model.tf.Paraphrase.
if t5 in model, will return malaya.model.t5.Paraphrase.
- Return type
model
malaya.pos¶
-
malaya.pos.
available_transformer
()[source]¶ List available transformer Part-Of-Speech Tagging models.
-
malaya.pos.
naive
(string: str)[source]¶ Recognize POS in a string using Regex.
- Parameters
string (str) –
- Returns
string
- Return type
List[Tuple[str, str]]
-
malaya.pos.
transformer
(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]¶ Load Transformer POS Tagging model, transfer learning Transformer + CRF.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.TaggingBERT.
if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.
- Return type
model
malaya.preprocessing¶
-
malaya.preprocessing.
unpack_english_contractions
(text)[source]¶ Replace English contractions in
text
str with their unshortened forms. N.B. The “‘d” and “‘s” forms are ambiguous (had/would, is/has/possessive), so are left as-is. Important Note: The function is taken from textacy (https://github.com/chartbeat-labs/textacy).
-
malaya.preprocessing.
preprocessing
(normalize: List[str] = ['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'], annotate: List[str] = ['allcaps', 'elongated', 'repeated', 'emphasis', 'censored', 'hashtag'], lowercase: bool = True, fix_unidecode: bool = True, expand_english_contractions: bool = True, translate_english_to_bm: bool = True, speller=None, segmenter=None, stemmer=None, **kwargs)[source]¶ Load Preprocessing class.
- Parameters
normalize (list) – normalizing tokens, can check all supported normalizing at malaya.preprocessing.get_normalize().
annotate (list) – annonate tokens <open></open>, only accept [‘hashtag’, ‘allcaps’, ‘elongated’, ‘repeated’, ‘emphasis’, ‘censored’].
lowercase (bool) –
fix_unidecode (bool) –
expand_english_contractions (bool) – expand english contractions
translate_english_to_bm (bool) – translate english words to bahasa malaysia words
speller (object) – spelling correction object, need to have a method correct
segmenter (object) – segmentation object, need to have a method segment. If provide, it will expand hashtags, #mondayblues == monday blues
stemmer (object) – stemmer object, need to have a method stem. If provide, it will stem or lemmatize the string.
- Returns
result
- Return type
malaya.preprocessing.Preprocessing class
malaya.relevancy¶
-
malaya.relevancy.
available_transformer
()[source]¶ List available transformer relevancy analysis models.
-
malaya.relevancy.
transformer
(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]¶ Load Transformer relevancy model.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.'bigbird'
- Google BigBird BASE parameters.'tiny-bigbird'
- Malaya BigBird BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.MulticlassBERT.
if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET.
if bigbird in model, will return malaya.model.xlnet.MulticlassBigBird.
- Return type
model
malaya.segmentation¶
-
malaya.segmentation.
viterbi
(max_split_length: int = 20, **kwargs)[source]¶ Load Segmenter class using viterbi algorithm.
- Parameters
max_split_length (int, (default=20)) – max length of words in a sentence to segment
validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
- Returns
result
- Return type
malaya.segmentation.Segmenter class
-
malaya.segmentation.
transformer
(model: str = 'small', quantized: bool = False, **kwargs)[source]¶ Load transformer encoder-decoder model to Segmentize.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'small'
- Transformer SMALL parameters.'base'
- Transformer BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.Segmentation class
malaya.sentiment¶
-
malaya.sentiment.
available_transformer
()[source]¶ List available transformer sentiment analysis models.
-
malaya.sentiment.
multinomial
(**kwargs)[source]¶ Load multinomial sentiment model.
- Returns
result
- Return type
malaya.model.ml.Bayes class
-
malaya.sentiment.
transformer
(model: str = 'bert', quantized: bool = False, **kwargs)[source]¶ Load Transformer sentiment model.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.BinaryBERT.
if xlnet in model, will return malaya.model.xlnet.BinaryXLNET.
- Return type
model
malaya.spell¶
-
class
malaya.spell.
Probability
(corpus, sp_tokenizer=None)[source]¶ The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation
-
correct
(word: str, **kwargs)[source]¶ Most probable spelling correction for word.
- Parameters
word (str) –
- Returns
result
- Return type
str
-
correct_text
(text: str)[source]¶ Correct all the words within a text, returning the corrected text.
- Parameters
text (str) –
- Returns
result
- Return type
str
-
-
class
malaya.spell.
Symspell
(model, verbosity, corpus, k=10)[source]¶ The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation
-
edit_step
(word)[source]¶ Generate candidates given a word.
- Parameters
word (str) –
- Returns
result
- Return type
{candidate1, candidate2}
-
edit_candidates
(word)[source]¶ Generate candidates given a word.
- Parameters
word (str) –
- Returns
result
- Return type
{candidate1, candidate2}
-
correct
(word: str, **kwargs)[source]¶ Most probable spelling correction for word.
- Parameters
word (str) –
- Returns
result
- Return type
str
-
-
malaya.spell.
probability
(sentence_piece: bool = False, **kwargs)[source]¶ Train a Probability Spell Corrector.
- Parameters
sentence_piece (bool, optional (default=False)) – if True, reduce possible augmentation states using sentence piece.
- Returns
result
- Return type
malaya.spell.Probability class
-
malaya.spell.
symspell
(max_edit_distance_dictionary: int = 2, prefix_length: int = 7, term_index: int = 0, count_index: int = 1, top_k: int = 10, **kwargs)[source]¶ Train a symspell Spell Corrector.
- Returns
result
- Return type
malaya.spell.Symspell class
-
malaya.spell.
transformer
(model, sentence_piece: bool = False, **kwargs)[source]¶ Load a Transformer Spell Corrector. Right now only supported BERT and ALBERT.
- Parameters
sentence_piece (bool, optional (default=False)) – if True, reduce possible augmentation states using sentence piece.
- Returns
result
- Return type
malaya.spell.Transformer class
-
class
malaya.spell.
Transformer
[source]¶ -
correct
(word: str, string: str, index: int = - 1, batch_size: int = 20)[source]¶ Correct a word within a text, returning the corrected word.
-
-
class
malaya.spell.
Probability
[source]¶ The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation
-
correct
(word: str, **kwargs)[source]¶ Most probable spelling correction for word.
- Parameters
word (str) –
- Returns
result
- Return type
str
-
correct_text
(text: str)[source]¶ Correct all the words within a text, returning the corrected text.
- Parameters
text (str) –
- Returns
result
- Return type
str
-
-
class
malaya.spell.
Symspell
[source]¶ The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation
-
edit_step
(word)[source]¶ Generate candidates given a word.
- Parameters
word (str) –
- Returns
result
- Return type
{candidate1, candidate2}
-
edit_candidates
(word)[source]¶ Generate candidates given a word.
- Parameters
word (str) –
- Returns
result
- Return type
{candidate1, candidate2}
-
correct
(word: str, **kwargs)[source]¶ Most probable spelling correction for word.
- Parameters
word (str) –
- Returns
result
- Return type
str
-
malaya.stack¶
-
malaya.stack.
voting_stack
(models, text: str)[source]¶ Stacking for POS, Entities and Dependency models.
- Parameters
models (list) – list of models.
text (str) – string to predict.
- Returns
result
- Return type
list
-
malaya.stack.
predict_stack
(models, strings: List[str], aggregate: Callable = <function gmean>, **kwargs)[source]¶ Stacking for predictive models.
- Parameters
models (List[Callable]) – list of models.
strings (List[str]) –
aggregate (Callable, optional (default=scipy.stats.mstats.gmean)) – Aggregate function.
- Returns
result
- Return type
dict
malaya.stem¶
-
malaya.stem.
naive
()[source]¶ Load stemming model using startswith and endswith naively using regex patterns.
- Returns
result
- Return type
malaya.stem.Naive class
-
malaya.stem.
sastrawi
()[source]¶ Load stemming model using Sastrawi, this also include lemmatization.
- Returns
result
- Return type
malaya.stem.Sastrawi class
-
malaya.stem.
deep_model
(quantized: bool = False, **kwargs)[source]¶ Load LSTM + Bahdanau Attention stemming model, this also include lemmatization. Original size 41.6MB, quantized size 10.6MB .
- Parameters
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.stem.DeepStemmer class
malaya.subjectivity¶
-
malaya.subjectivity.
available_transformer
()[source]¶ List available transformer subjective analysis models.
-
malaya.subjectivity.
multinomial
(**kwargs)[source]¶ Load multinomial subjectivity model.
- Parameters
validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
- Returns
result
- Return type
malaya.model.ml.Bayes class
-
malaya.subjectivity.
transformer
(model: str = 'bert', quantized: bool = False, **kwargs)[source]¶ Load Transformer subjectivity model.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.BinaryBERT.
if xlnet in model, will return malaya.model.xlnet.BinaryXLNET.
- Return type
model
malaya.tatabahasa¶
-
malaya.tatabahasa.
describe
()[source]¶ Describe kesalahan tatabahasa supported. Full description at https://tatabahasabm.tripod.com/tata/salahtata.htm
-
malaya.tatabahasa.
transformer
(model: str = 'base', quantized: bool = False, **kwargs)[source]¶ Load Malaya transformer encoder-decoder + tagging model to correct a kesalahan tatabahasa text.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'small'
- Malaya Transformer Tag SMALL parameters.'base'
- Malaya Transformer Tag BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.Tatabahasa class
malaya.summarization.abstractive¶
-
malaya.summarization.abstractive.
available_transformer
()[source]¶ List available transformer models.
-
malaya.summarization.abstractive.
transformer
(model: str = 't2t', quantized: bool = False, **kwargs)[source]¶ Load Malaya transformer encoder-decoder model to generate a summary given a string.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
't2t'
- Malaya Transformer BASE parameters.'small-t2t'
- Malaya Transformer SMALL parameters.'t5'
- T5 BASE parameters.'small-t5'
- T5 SMALL parameters.'bigbird'
- BigBird + Pegasus BASE parameters.'small-bigbird'
- BigBird + Pegasus SMALL parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if t2t in model, will return malaya.model.tf.Summarization.
if t5 in model, will return malaya.model.t5.Summarization.
if bigbird in model, will return malaya.model.bigbird.Summarization.
- Return type
model
malaya.summarization.extractive¶
-
malaya.summarization.extractive.
encoder
(vectorizer)[source]¶ Encoder interface for summarization.
- Parameters
vectorizer (object) – encoder interface object, eg, BERT, XLNET, ALBERT, ALXLNET. should have vectorize method.
- Returns
result
- Return type
-
malaya.summarization.extractive.
doc2vec
(wordvector)[source]¶ Doc2Vec interface for summarization.
- Parameters
wordvector (object) – malaya.wordvector.WordVector object. should have get_vector_by_name method.
- Returns
result
- Return type
-
malaya.summarization.extractive.
sklearn
(model, vectorizer)[source]¶ sklearn interface for summarization.
- Parameters
model (object) –
Should have fit_transform method. Commonly:
sklearn.decomposition.TruncatedSVD
- LSA algorithm.sklearn.decomposition.LatentDirichletAllocation
- LDA algorithm.
vectorizer (object) –
Should have fit_transform method. Commonly:
sklearn.feature_extraction.text.TfidfVectorizer
- TFIDF algorithm.sklearn.feature_extraction.text.CountVectorizer
- Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramCountVectorizer
- Skip Gram Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramTfidfVectorizer
- Skip Gram TFIDF algorithm.
- Returns
result
- Return type
malaya.similarity¶
-
malaya.similarity.
doc2vec
(wordvector)[source]¶ Doc2vec interface for text similarity.
- Parameters
wordvector (object) – malaya.wordvector.WordVector object. should have get_vector_by_name method.
- Returns
result
- Return type
-
malaya.similarity.
encoder
(vectorizer)[source]¶ Encoder interface for text similarity.
- Parameters
vectorizer (object) – encoder interface object, BERT, skip-thought, XLNET.
- Returns
result
- Return type
-
malaya.similarity.
transformer
(model: str = 'bert', quantized: bool = False, **kwargs)[source]¶ Load Transformer similarity model.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.SiameseBERT.
if xlnet in model, will return malaya.model.xlnet.SiameseXLNET.
- Return type
model
-
class
malaya.similarity.
VectorizerSimilarity
[source]¶ -
predict_proba
(left_strings: List[str], right_strings: List[str], similarity: str = 'cosine')[source]¶ calculate similarity for two different batch of texts.
- Parameters
left_strings (list of str) –
right_strings (list of str) –
similarity (str, optional (default='mean')) –
similarity supported. Allowed values:
'cosine'
- cosine similarity.'euclidean'
- euclidean similarity.'manhattan'
- manhattan similarity.
- Returns
result
- Return type
List[float]
-
heatmap
(strings: List[str], similarity: str = 'cosine', visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7))[source]¶ plot a heatmap based on output from bert similarity.
- Parameters
strings (list of str) – list of strings.
similarity (str, optional (default='mean')) –
similarity supported. Allowed values:
'cosine'
- cosine similarity.'euclidean'
- euclidean similarity.'manhattan'
- manhattan similarity.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
result – list of results
- Return type
list
-
-
class
malaya.similarity.
Doc2VecSimilarity
[source]¶ -
predict_proba
(left_strings: List[str], right_strings: List[str], aggregation: Callable = <function mean>, similarity: str = 'cosine', soft: bool = False)[source]¶ calculate similarity for two different batch of texts.
- Parameters
left_strings (list of str) –
right_strings (list of str) –
aggregation (Callable, optional (default=numpy.mean)) –
similarity (str, optional (default='mean')) –
similarity supported. Allowed values:
'cosine'
- cosine similarity.'euclidean'
- euclidean similarity.'manhattan'
- manhattan similarity.
soft (bool, optional (default=False)) – word not inside word vector will replace with nearest word if True, else, will skip.
- Returns
result
- Return type
List[float]
-
heatmap
(strings: List[str], aggregation: Callable = <function mean>, similarity: str = 'cosine', soft: bool = False, visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7))[source]¶ plot a heatmap based on output from bert similarity.
- Parameters
strings (list of str) – list of strings
aggregation (Callable, optional (default=numpy.mean)) –
similarity (str, optional (default='mean')) –
similarity supported. Allowed values:
'cosine'
- cosine similarity.'euclidean'
- euclidean similarity.'manhattan'
- manhattan similarity.
soft (bool, optional (default=True)) – word not inside word vector will replace with nearest word if True, else, will skip.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
result – list of results.
- Return type
list
-
malaya.topic_model¶
-
malaya.topic_model.
sklearn
(corpus: List[str], model, vectorizer, n_topics: int, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, **kwargs)[source]¶ Train a SKlearn model to do topic modelling based on corpus / list of strings given.
- Parameters
corpus (list) –
model (object) –
Should have fit_transform method. Commonly:
sklearn.decomposition.TruncatedSVD
- LSA algorithm.sklearn.decomposition.LatentDirichletAllocation
- LDA algorithm.sklearn.decomposition.NMF
- NMF algorithm.
vectorizer (object) –
Should have fit_transform method. Commonly:
sklearn.feature_extraction.text.TfidfVectorizer
- TFIDF algorithm.sklearn.feature_extraction.text.CountVectorizer
- Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramCountVectorizer
- Skip Gram Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramTfidfVectorizer
- Skip Gram TFIDF algorithm.
n_topics (int, (default=10)) – size of decomposition column.
cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
- Returns
result
- Return type
malaya.topic_modelling.Topic class
-
malaya.topic_model.
lda2vec
(corpus: List[str], vectorizer, n_topics: int = 10, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, window_size: int = 2, embedding_size: int = 128, epoch: int = 10, switch_loss: int = 1000, **kwargs)[source]¶ Train a LDA2Vec model to do topic modelling based on corpus / list of strings given.
- Parameters
corpus (list) –
vectorizer (object) –
Should have fit_transform method. Commonly:
sklearn.feature_extraction.text.TfidfVectorizer
- TFIDF algorithm.sklearn.feature_extraction.text.CountVectorizer
- Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramCountVectorizer
- Skip Gram Bag-of-Word algorithm.malaya.text.vectorizer.SkipGramTfidfVectorizer
- Skip Gram TFIDF algorithm.
n_topics (int, (default=10)) – size of decomposition column.
cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
embedding_size (int, (default=128)) – embedding size of lda2vec tensors.
epoch (int, (default=10)) – training iteration, how many loop need to train.
switch_loss (int, (default=3)) – baseline to switch from document based loss to document + word based loss.
- Returns
result
- Return type
malaya.topic_modelling.DeepTopic class
-
malaya.topic_model.
attention
(corpus: List[str], n_topics: int, vectorizer, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, ngram: Tuple[int, int] = (1, 3), batch_size: int = 10)[source]¶ Use attention from transformer model to do topic modelling based on corpus / list of strings given.
- Parameters
corpus (list) –
n_topics (int, (default=10)) – size of decomposition column.
vectorizer (object) –
cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.
stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]
ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
batch_size (int, (default=10)) – size of strings for each vectorization and attention.
- Returns
result
- Return type
malaya.topic_modelling.AttentionTopic class
-
class
malaya.topic_model.
AttentionTopic
[source]¶ -
top_topics
(len_topic: int, top_n: int = 10, return_df: bool = True)[source]¶ Print important topics based on decomposition.
- Parameters
len_topic (int) – size of topics.
top_n (int, optional (default=10)) – top n of each topic.
return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.
-
-
class
malaya.topic_model.
DeepTopic
[source]¶ -
visualize_topics
(notebook_mode: int = False, mds: str = 'pcoa')[source]¶ Print important topics based on decomposition.
- Parameters
mds (str, optional (default='pcoa')) –
2D Decomposition. Allowed values:
'pcoa'
- Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)'mmds'
- Dimension reduction via Multidimensional scaling'tsne'
- Dimension reduction via t-distributed stochastic neighbor embedding
-
top_topics
(len_topic: int, top_n: int = 10, return_df: bool = True)[source]¶ Print important topics based on decomposition.
- Parameters
len_topic (int) – size of topics.
top_n (int, optional (default=10)) – top n of each topic.
return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.
-
-
class
malaya.topic_model.
Topic
[source]¶ -
visualize_topics
(notebook_mode: bool = False, mds: str = 'pcoa')[source]¶ Print important topics based on decomposition.
- Parameters
mds (str, optional (default='pcoa')) –
2D Decomposition. Allowed values:
'pcoa'
- Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)'mmds'
- Dimension reduction via Multidimensional scaling'tsne'
- Dimension reduction via t-distributed stochastic neighbor embedding
-
top_topics
(len_topic: int, top_n: int = 10, return_df: bool = True)[source]¶ Print important topics based on decomposition.
- Parameters
len_topic (int) – size of topics.
top_n (int, optional (default=10)) – top n of each topic.
return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.
-
malaya.toxicity¶
-
malaya.toxicity.
available_transformer
()[source]¶ List available transformer toxicity analysis models.
-
malaya.toxicity.
multinomial
(**kwargs)[source]¶ Load multinomial toxicity model.
- Returns
result
- Return type
malaya.model.ml.MultilabelBayes class
-
malaya.toxicity.
transformer
(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]¶ Load Transformer toxicity model.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.SigmoidBERT.
if xlnet in model, will return malaya.model.xlnet.SigmoidXLNET.
- Return type
model
malaya.transformer¶
-
malaya.transformer.
available_transformer_standard_language
()[source]¶ List available transformer models.
-
malaya.transformer.
load
(model: str = 'electra', pool_mode: str = 'last', **kwargs)[source]¶ Load transformer model.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.'electra'
- Google ELECTRA BASE parameters.'small-electra'
- Google ELECTRA SMALL parameters.
pool_mode (str, optional (default='last')) –
Model logits architecture supported. Only usable if model in [‘xlnet’, ‘alxlnet’]. Allowed values:
'last'
- last of the sequence.'first'
- first of the sequence.'mean'
- mean of the sequence.'attn'
- attention of the sequence.
- Returns
result – List of model classes:
if bert in model, will return malaya.transformers.bert.Model.
if xlnet in model, will return malaya.transformers.xlnet.Model.
if albert in model, will return malaya.transformers.albert.Model.
if electra in model, will return malaya.transformers.electra.Model.
- Return type
model
malaya.translation.en_ms¶
-
malaya.translation.en_ms.
transformer
(model: str = 'base', quantized: bool = False, **kwargs)[source]¶ Load transformer encoder-decoder model to translate EN-to-MS.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'small'
- Transformer SMALL parameters.'base'
- Transformer BASE parameters.'large'
- Transformer LARGE parameters.'bigbird'
- BigBird BASE parameters.'small-bigbird'
- BigBird SMALL parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bigbird in model, return malaya.model.bigbird.Translation.
else, return malaya.model.tf.Translation.
- Return type
model
malaya.translation.ms_en¶
-
malaya.translation.ms_en.
transformer
(model: str = 'base', quantized: bool = False, **kwargs)[source]¶ Load Transformer encoder-decoder model to translate MS-to-EN.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'small'
- Transformer SMALL parameters.'base'
- Transformer BASE parameters.'large'
- Transformer LARGE parameters.'bigbird'
- BigBird BASE parameters.'small-bigbird'
- BigBird SMALL parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bigbird in model, return malaya.model.bigbird.Translation.
else, return malaya.model.tf.Translation.
- Return type
model
malaya.true_case¶
-
malaya.true_case.
transformer
(model: str = 'base', quantized: bool = False, **kwargs)[source]¶ Load transformer encoder-decoder model to True Case.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'small'
- Transformer SMALL parameters.'base'
- Transformer BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya.model.tf.TrueCase class
malaya.word2num¶
-
malaya.word2num.
word2num
(string)[source]¶ Translate from string to number, eg ‘kesepuluh’ -> 10.
- Parameters
string (str) –
- Returns
result
- Return type
int / float
malaya.wordvector¶
-
malaya.wordvector.
load_wiki
()[source]¶ Return malaya pretrained wikipedia word2vec size 256. https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/wordvector
- Returns
vocabulary (indices dictionary for vector.)
vector (np.array, 2D.)
-
malaya.wordvector.
load_news
()[source]¶ Return malaya pretrained local malaysia news word2vec size 256. https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/wordvector
- Returns
vocabulary (indices dictionary for vector.)
vector (np.array, 2D.)
Return malaya pretrained local malaysia social media word2vec size 256. https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/wordvector
- Returns
vocabulary (indices dictionary for vector.)
vector (np.array, 2D.)
Return malaya pretrained local malaysia Wikipedia + Social media + News word2vec size 256. https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/wordvector
- Returns
vocabulary (indices dictionary for vector.)
vector (np.array, 2D.)
-
malaya.wordvector.
load
(embed_matrix, dictionary: dict)[source]¶ Return malaya.wordvector.WordVector object.
- Parameters
embed_matrix (numpy array) –
dictionary (dictionary) –
- Returns
WordVector
- Return type
malaya.wordvector.WordVector object
-
class
malaya.wordvector.
WordVector
[source]¶ -
get_vector_by_name
(word: str, soft: bool = False, topn_soft: int = 5)[source]¶ get vector based on string.
- Parameters
word (str) –
soft (bool, (default=True)) – if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary.
topn_soft (int, (default=5)) – if word not found in dictionary, will returned topn_soft size of similar size using jarowinkler.
- Returns
vector
- Return type
np.array, 1D
-
tree_plot
(labels, figsize: Tuple[int, int] = (7, 7), annotate: bool = True)[source]¶ plot a tree plot based on output from calculator / n_closest / analogy.
- Parameters
labels (list) – output from calculator / n_closest / analogy.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
embed (np.array, 2D.)
labelled (labels for X / Y axis.)
-
scatter_plot
(labels, centre: str = None, figsize: Tuple[int, int] = (7, 7), plus_minus: int = 25, handoff: float = 5e-05)[source]¶ plot a scatter plot based on output from calculator / n_closest / analogy.
- Parameters
labels (list) – output from calculator / n_closest / analogy
centre (str, (default=None)) – centre label, if a str, it will annotate in a red color.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
tsne
- Return type
np.array, 2D.
-
batch_calculator
(equations: List[str], num_closest: int = 5, return_similarity: bool = False)[source]¶ batch calculator parser for word2vec using tensorflow.
- Parameters
equations (list of str) – Eg, ‘[(mahathir + najib) - rosmah]’
num_closest (int, (default=5)) – number of words closest to the result.
- Returns
word_list
- Return type
list of nearest words
-
calculator
(equation: str, num_closest: int = 5, metric: str = 'cosine', return_similarity: bool = True)[source]¶ calculator parser for word2vec.
- Parameters
equation (str) – Eg, ‘(mahathir + najib) - rosmah’
num_closest (int, (default=5)) – number of words closest to the result.
metric (str, (default='cosine')) – vector distance algorithm.
return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.
- Returns
word_list
- Return type
list of nearest words
-
batch_n_closest
(words: List[str], num_closest: int = 5, return_similarity: bool = False, soft: bool = True)[source]¶ find nearest words based on a batch of words using Tensorflow.
- Parameters
words (list) – Eg, [‘najib’,’anwar’]
num_closest (int, (default=5)) – number of words closest to the result.
return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.
soft (bool, (default=True)) – if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary.
- Returns
word_list
- Return type
list of nearest words
-
n_closest
(word: str, num_closest: int = 5, metric: str = 'cosine', return_similarity: bool = True)[source]¶ find nearest words based on a word.
- Parameters
word (str) – Eg, ‘najib’
num_closest (int, (default=5)) – number of words closest to the result.
metric (str, (default='cosine')) – vector distance algorithm.
return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.
- Returns
word_list
- Return type
list of nearest words
-
analogy
(a: str, b: str, c: str, num: int = 1, metric: str = 'cosine')[source]¶ analogy calculation, vb - va + vc.
- Parameters
a (str) –
b (str) –
c (str) –
num (int, (default=1)) –
metric (str, (default='cosine')) – vector distance algorithm.
- Returns
word_list
- Return type
list of nearest words.
-
project_2d
(start: int, end: int)[source]¶ project word2vec into 2d dimension.
- Parameters
start (int) –
end (int) –
- Returns
embed_2d (TSNE decomposition)
word_list (words in between start and end.)
-
network
(word: str, num_closest: int = 8, depth: int = 4, min_distance: float = 0.5, iteration: int = 300, figsize: Tuple[int, int] = (15, 15), node_color: str = '#72bbd0', node_factor: int = 50)[source]¶ plot a social network based on word given
- Parameters
word (str) – centre of social network.
num_closest (int, (default=8)) – number of words closest to the node.
depth (int, (default=4)) – depth of social network. More deeper more expensive to calculate, big^O(num_closest ** depth).
min_distance (float, (default=0.5)) – minimum distance among nodes. Increase the value to increase the distance among nodes.
iteration (int, (default=300)) – number of loops to train the social network to fit min_distace.
figsize (tuple, (default=(15, 15))) – figure size for plot.
node_color (str, (default='#72bbd0')) – color for nodes.
node_factor (int, (default=10)) – size factor for depth nodes. Increase this value will increase nodes sizes based on depth.
- Returns
G
- Return type
networkx graph object
-
malaya.zero_shot.classification¶
-
malaya.zero_shot.classification.
available_transformer
()[source]¶ List available transformer zero-shot models.
-
malaya.zero_shot.classification.
transformer
(model: str = 'bert', quantized: bool = False, **kwargs)[source]¶ Load Transformer zero-shot model.
- Parameters
model (str, optional (default='bert')) –
Model architecture supported. Allowed values:
'bert'
- Google BERT BASE parameters.'tiny-bert'
- Google BERT TINY parameters.'albert'
- Google ALBERT BASE parameters.'tiny-albert'
- Google ALBERT TINY parameters.'xlnet'
- Google XLNET BASE parameters.'alxlnet'
- Malaya ALXLNET BASE parameters.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result – List of model classes:
if bert in model, will return malaya.model.bert.ZeroshotBERT.
if xlnet in model, will return malaya.model.xlnet.ZeroshotXLNET.
- Return type
model
malaya.model.bert¶
-
class
malaya.model.bert.
BinaryBERT
[source]¶ -
vectorize
(strings: List[str], method: str = 'first')[source]¶ vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings: List[str], add_neutral: bool = True)[source]¶ classify list of strings.
- Parameters
strings (List[str]) –
add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
- Returns
result
- Return type
List[str]
-
predict_proba
(strings: List[str], add_neutral: bool = True)[source]¶ classify list of strings and return probability.
- Parameters
strings (List[str]) –
add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string: str, method: str = 'last', visualization: bool = True)[source]¶ classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
result
- Return type
dict
-
-
class
malaya.model.bert.
MulticlassBERT
[source]¶ -
vectorize
(strings: List[str], method: str = 'first')[source]¶ vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings: List[str])[source]¶ classify list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
predict_proba
(strings: List[str])[source]¶ classify list of strings and return probability.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string: str, method: str = 'last', visualization: bool = True)[source]¶ classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
result
- Return type
dict
-
-
class
malaya.model.bert.
SigmoidBERT
[source]¶ -
vectorize
(strings: List[str], method: str = 'first')[source]¶ vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings: List[str])[source]¶ classify list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[List[str]]
-
predict_proba
(strings: List[str])[source]¶ classify list of strings and return probability.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string: str, method: str = 'last', visualization: bool = True)[source]¶ classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
dictionary
- Return type
results
-
-
class
malaya.model.bert.
SiameseBERT
[source]¶ -
vectorize
(strings: List[str])[source]¶ Vectorize list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
predict_proba
(strings_left: List[str], strings_right: List[str])[source]¶ calculate similarity for two different batch of texts.
- Parameters
strings_left (List[str]) –
strings_right (List[str]) –
- Returns
list
- Return type
list of float
-
heatmap
(strings: List[str], visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7))[source]¶ plot a heatmap based on output from similarity
- Parameters
strings (list of str) – list of strings.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
result – list of results
- Return type
list
-
-
class
malaya.model.bert.
TaggingBERT
[source]¶ -
vectorize
(string: str)[source]¶ vectorize a string.
- Parameters
string (List[str]) –
- Returns
result
- Return type
np.array
-
-
class
malaya.model.bert.
DependencyBERT
[source]¶
-
class
malaya.model.bert.
ZeroshotBERT
[source]¶ -
vectorize
(strings: List[str], labels: List[str], method: str = 'first')[source]¶ vectorize a string.
- Parameters
strings (List[str]) –
labels (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
malaya.model.bigbird¶
-
class
malaya.model.bigbird.
MulticlassBigBird
[source]¶ -
vectorize
(strings: List[str], method: str = 'first')[source]¶ vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
malaya.model.extractive_summarization¶
-
class
malaya.model.extractive_summarization.
SKLearn
[source]¶ -
word_level
(corpus, isi_penting: str = None, window_size: int = 10, important_words: int = 10, **kwargs)[source]¶ Summarize list of strings / string on word level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
window_size (int, (default=10)) – window size for each word.
important_words (int, (default=10)) – number of important words.
- Returns
dict
- Return type
{‘top-words’, ‘cluster-top-words’, ‘score’}
-
sentence_level
(corpus, isi_penting: str = None, top_k: int = 3, important_words: int = 10, **kwargs)[source]¶ Summarize list of strings / string on sentence level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
top_k (int, (default=3)) – number of summarized strings.
important_words (int, (default=10)) – number of important words.
- Returns
dict
- Return type
{‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’}
-
-
class
malaya.model.extractive_summarization.
Doc2Vec
[source]¶ -
word_level
(corpus, isi_penting: Optional[str] = None, window_size: int = 10, aggregation=<function mean>, soft: bool = False, **kwargs)[source]¶ Summarize list of strings / string on sentence level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
window_size (int, (default=10)) – window size for each word.
aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.
soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.
- Returns
dict
- Return type
{‘score’}
-
sentence_level
(corpus, isi_penting: Optional[str] = None, top_k: int = 3, aggregation=<function mean>, soft: bool = False, **kwargs)[source]¶ Summarize list of strings / string on sentence level.
- Parameters
corpus (str / List[str]) –
isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.
top_k (int, (default=3)) – number of summarized strings.
aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.
soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.
- Returns
dict
- Return type
{‘summary’, ‘score’}
-
malaya.model.ml¶
-
class
malaya.model.ml.
MulticlassBayes
[source]¶
-
class
malaya.model.ml.
BinaryBayes
[source]¶
-
class
malaya.model.ml.
MultilabelBayes
[source]¶
malaya.model.t5¶
-
class
malaya.model.t5.
Summarization
[source]¶ -
greedy_decoder
(strings: List[str], mode: str = 'ringkasan', postprocess: bool = True, **kwargs)[source]¶ Summarize strings. Decoder is greedy decoder with beam width size 1, alpha 0.5 .
- Parameters
strings (List[str]) –
mode (str) –
mode for summarization. Allowed values:
'ringkasan'
- summarization for long sentence, eg, news summarization.'tajuk'
- title summarization for long sentence, eg, news title.
postprocess (bool, optional (default=True)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.
- Returns
result
- Return type
List[str]
-
-
class
malaya.model.t5.
Paraphrase
[source]¶ -
greedy_decoder
(strings: List[str], split_fullstop: bool = True)[source]¶ paraphrase strings. Decoder is greedy decoder with beam width size 1, alpha 0.5 .
- Parameters
strings (List[str]) –
split_fullstop (bool, (default=True)) – if True, will generate paraphrase for each strings splitted by fullstop.
- Returns
result
- Return type
List[str]
-
malaya.model.tf¶
-
class
malaya.model.tf.
DeepLang
[source]¶
-
class
malaya.model.tf.
Translation
[source]¶
-
class
malaya.model.tf.
Constituency
[source]¶ -
vectorize
(string: str)[source]¶ vectorize a string.
- Parameters
string (List[str]) –
- Returns
result
- Return type
np.array
-
-
class
malaya.model.tf.
TrueCase
[source]¶
-
class
malaya.model.tf.
Segmentation
[source]¶
-
class
malaya.model.tf.
Paraphrase
[source]¶ -
greedy_decoder
(strings: List[str], **kwargs)[source]¶ Paraphrase strings using greedy decoder.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
malaya.model.xlnet¶
-
class
malaya.model.xlnet.
BinaryXLNET
[source]¶ -
vectorize
(strings: List[str], method: str = 'first')[source]¶ vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings: List[str], add_neutral: bool = True)[source]¶ classify list of strings.
- Parameters
strings (List[str]) –
add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
- Returns
result
- Return type
List[str]
-
predict_proba
(strings: List[str], add_neutral: bool = True)[source]¶ classify list of strings and return probability.
- Parameters
strings (List[str]) –
add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string: str, method: str = 'last', visualization: bool = True)[source]¶ classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
result
- Return type
dict
-
-
class
malaya.model.xlnet.
MulticlassXLNET
[source]¶ -
vectorize
(strings: List[str], method: str = 'first')[source]¶ vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings: List[str])[source]¶ classify list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
predict_proba
(strings: List[str])[source]¶ classify list of strings and return probability.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string: str, method: str = 'last', visualization: bool = True)[source]¶ classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
result
- Return type
dict
-
-
class
malaya.model.xlnet.
SigmoidXLNET
[source]¶ -
vectorize
(strings: List[str], method: str = 'first')[source]¶ vectorize list of strings.
- Parameters
strings (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
predict
(strings: List[str])[source]¶ classify list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[List[str]]
-
predict_proba
(strings: List[str])[source]¶ classify list of strings and return probability.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[dict[str, float]]
-
predict_words
(string: str, method: str = 'last', visualization: bool = True)[source]¶ classify words.
- Parameters
string (str) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
- Returns
dictionary
- Return type
results
-
-
class
malaya.model.xlnet.
SiameseXLNET
[source]¶ -
vectorize
(strings: List[str])[source]¶ Vectorize list of strings.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
predict_proba
(strings_left: List[str], strings_right: List[str])[source]¶ calculate similarity for two different batch of texts.
- Parameters
string_left (List[str]) –
string_right (List[str]) –
- Returns
result
- Return type
List[float]
-
heatmap
(strings: List[str], visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7))[source]¶ plot a heatmap based on output from similarity
- Parameters
strings (list of str) – list of strings.
visualize (bool) – if True, it will render plt.show, else return data.
figsize (tuple, (default=(7, 7))) – figure size for plot.
- Returns
result – list of results
- Return type
list
-
-
class
malaya.model.xlnet.
TaggingXLNET
[source]¶ -
vectorize
(string: str)[source]¶ vectorize a string.
- Parameters
string (List[str]) –
- Returns
result
- Return type
np.array
-
-
class
malaya.model.xlnet.
DependencyXLNET
[source]¶
-
class
malaya.model.xlnet.
ZeroshotXLNET
[source]¶ -
vectorize
(strings: List[str], labels: List[str], method: str = 'first')[source]¶ vectorize a string.
- Parameters
strings (List[str]) –
labels (List[str]) –
method (str, optional (default='first')) –
Vectorization layer supported. Allowed values:
'last'
- vector from last sequence.'first'
- vector from first sequence.'mean'
- average vectors from all sequences.'word'
- average vectors based on tokens.
- Returns
result
- Return type
np.array
-
malaya.transformers.albert¶
-
malaya.transformers.albert.
load
(model: str = 'albert', **kwargs)[source]¶ Load albert model.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'albert'
- base albert-bahasa released by Malaya.'tiny-albert'
- tiny bert-bahasa released by Malaya.
- Returns
result
- Return type
malaya.transformers.albert.Model class
-
class
malaya.transformers.albert.
Model
[source]¶ -
vectorize
(strings: List[str])[source]¶ Vectorize string inputs.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
attention
(strings: List[str], method: str = 'last', **kwargs)[source]¶ Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-
malaya.transformers.alxlnet¶
-
malaya.transformers.alxlnet.
load
(model: str = 'alxlnet', pool_mode: str = 'last', **kwargs)[source]¶ Load alxlnet model.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'alxlnet'
- XLNET architecture from google + Malaya.
pool_mode (str, optional (default='last')) –
Model logits architecture supported. Allowed values:
'last'
- last of the sequence.'first'
- first of the sequence.'mean'
- mean of the sequence.'attn'
- attention of the sequence.
- Returns
result
- Return type
malaya.transformers.alxlnet.Model class
-
class
malaya.transformers.alxlnet.
Model
[source]¶ -
vectorize
(strings: List[str])[source]¶ Vectorize string inputs.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
attention
(strings: List[str], method: str = 'last', **kwargs)[source]¶ Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-
malaya.transformers.bert¶
-
malaya.transformers.bert.
load
(model: str = 'base', **kwargs)[source]¶ Load bert model.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'bert'
- base bert-bahasa released by Malaya.'tiny-bert'
- tiny bert-bahasa released by Malaya.
- Returns
result
- Return type
malaya.transformers.bert.Model class
-
class
malaya.transformers.bert.
Model
[source]¶ -
vectorize
(strings: List[str])[source]¶ Vectorize string inputs.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
attention
(strings: List[str], method: str = 'last', **kwargs)[source]¶ Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-
malaya.transformers.electra¶
-
malaya.transformers.electra.
load
(model: str = 'electra', **kwargs)[source]¶ Load electra model.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'electra'
- base electra-bahasa released by Malaya.'small-electra'
- small electra-bahasa released by Malaya.
- Returns
result
- Return type
malaya.transformers.electra.Model class
-
class
malaya.transformers.electra.
Model
[source]¶ -
vectorize
(strings: List[str])[source]¶ Vectorize string inputs.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
attention
(strings: List[str], method: str = 'last', **kwargs)[source]¶ Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-
malaya.transformers.gpt2¶
-
malaya.transformers.gpt2.
load
(model='345M', generate_length=100, temperature=1.0, top_k=40, **kwargs)[source]¶ Load gpt2 model.
- Parameters
model (str, optional (default='345M')) –
Model architecture supported. Allowed values:
'117M'
- GPT2 117M parameters.'345M'
- GPT2 345M parameters.
generate_length (int, optional (default=256)) – length of sentence to generate.
temperature (float, optional (default=1.0)) – temperature value, value should between 0 and 1.
top_k (int, optional (default=40)) – top-k in nucleus sampling selection.
- Returns
result
- Return type
malaya.transformers.gpt2.Model class
malaya.transformers.xlnet¶
-
malaya.transformers.xlnet.
load
(model: str = 'xlnet', pool_mode: str = 'last', **kwargs)[source]¶ Load xlnet model.
- Parameters
model (str, optional (default='base')) –
Model architecture supported. Allowed values:
'xlnet'
- XLNET architecture from google.
pool_mode (str, optional (default='last')) –
Model logits architecture supported. Allowed values:
'last'
- last of the sequence.'first'
- first of the sequence.'mean'
- mean of the sequence.'attn'
- attention of the sequence.
- Returns
result
- Return type
malaya.transformers.xlnet.Model class
-
class
malaya.transformers.xlnet.
Model
[source]¶ -
vectorize
(strings: List[str])[source]¶ Vectorize string inputs.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
np.array
-
attention
(strings: List[str], method: str = 'last', **kwargs)[source]¶ Get attention string inputs.
- Parameters
strings (List[str]) –
method (str, optional (default='last')) –
Attention layer supported. Allowed values:
'last'
- attention from last layer.'first'
- attention from first layer.'mean'
- average attentions from all layers.
- Returns
result
- Return type
List[List[Tuple[str, float]]]
-