API

malaya

malaya.augmentation

malaya.augmentation.synonym(string: str, threshold: float = 0.5, top_n=5, cleaning=<function augmentation_textcleaning>, **kwargs)[source]

augmenting a string using synonym, https://github.com/huseinzol05/Malaya-Dataset#90k-synonym

Parameters
  • string (str) –

  • threshold (float, optional (default=0.5)) – random selection for a word.

  • top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.

  • cleaning (function, (default=malaya.text.function.augmentation_textcleaning)) – function to clean text.

Returns

result

Return type

List[str]

malaya.augmentation.wordvector(string: str, wordvector, threshold: float = 0.5, top_n: int = 5, soft: bool = False, cleaning=<function augmentation_textcleaning>)[source]

augmenting a string using wordvector.

Parameters
  • string (str) –

  • wordvector (object) – wordvector interface object.

  • threshold (float, optional (default=0.5)) – random selection for a word.

  • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.

  • top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.

  • cleaning (function, (default=malaya.text.function.augmentation_textcleaning)) – function to clean text.

Returns

result

Return type

List[str]

malaya.augmentation.transformer(string: str, model, threshold: float = 0.5, top_p: float = 0.9, top_k: int = 100, temperature: float = 1.0, top_n: int = 5, cleaning=None)[source]

augmenting a string using transformer + nucleus sampling / top-k sampling.

Parameters
  • string (str) –

  • model (object) – transformer interface object. Right now only supported BERT, ALBERT and ELECTRA.

  • threshold (float, optional (default=0.5)) – random selection for a word.

  • top_p (float, optional (default=0.8)) – cumulative sum of probabilities to sample a word. If top_n bigger than 0, the model will use nucleus sampling, else top-k sampling.

  • top_k (int, optional (default=100)) – k for top-k sampling.

  • temperature (float, optional (default=0.8)) – logits * temperature.

  • top_n (int, (default=5)) – number of nearest neighbors returned. Length of returned result should as top_n.

  • cleaning (function, (default=None)) – function to clean text.

Returns

result

Return type

List[str]

malaya.cluster

malaya.cluster.cluster_words(list_words: List[str], lowercase: bool = False)[source]

cluster similar words based on structure, eg, [‘mahathir mohamad’, ‘mahathir’] = [‘mahathir mohamad’]. big O = n^2

Parameters
  • list_words (List[str]) –

  • lowercase (bool, optional (default=True)) – if True, will group using lowercase but maintain the original form.

Returns

string

Return type

List[str]

malaya.cluster.cluster_pos(result: List[Tuple[str, str]])[source]

cluster similar POS.

Parameters

result (List[Tuple[str, str]]) –

Returns

result

Return type

Dict[str, List[str]]

malaya.cluster.cluster_entities(result: List[Tuple[str, str]])[source]

cluster similar Entities.

Parameters

result (List[Tuple[str, str]]) –

Returns

result

Return type

Dict[str, List[str]]

malaya.cluster.cluster_tagging(result: List[Tuple[str, str]])[source]

cluster any tagging results, as long the data passed [(string, label), (string, label)].

Parameters

result (List[Tuple[str, str]]) –

Returns

result

Return type

Dict[str, List[str]]

malaya.cluster.cluster_scatter(corpus: List[str], vectorizer, num_clusters: int = 5, titles: List[str] = None, colors: List[str] = None, stopwords=<function get_stopwords>, cleaning=<function simple_textcleaning>, clustering=<class 'sklearn.cluster._kmeans.KMeans'>, decomposition=<class 'sklearn.manifold._mds.MDS'>, ngram: Tuple[int, int] = (1, 3), figsize: Tuple[int, int] = (17, 9), batch_size: int = 20)[source]

plot scatter plot on similar text clusters.

Parameters
  • corpus (List[str]) –

  • vectorizer (class) – vectorizer class.

  • num_clusters (int, (default=5)) – size of unsupervised clusters.

  • titles (List[str], (default=None)) – list of titles, length must same with corpus.

  • colors (List[str], (default=None)) – list of colors, length must same with num_clusters.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

  • ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.

  • cleaning (function, (default=malaya.texts.function.simple_textcleaning)) – function to clean the corpus.

  • batch_size (int, (default=10)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.

Returns

dictionary

Return type

{‘X’: X, ‘Y’: Y, ‘labels’: clusters, ‘vector’: transformed_text_clean, ‘titles’: titles}

malaya.cluster.cluster_dendogram(corpus: List[str], vectorizer, titles: List[str] = None, stopwords=<function get_stopwords>, cleaning=<function simple_textcleaning>, random_samples: float = 0.3, ngram: Tuple[int, int] = (1, 3), figsize: Tuple[int, int] = (17, 9), batch_size: int = 20)[source]

plot hierarchical dendogram with similar texts.

Parameters
  • corpus (List[str]) –

  • vectorizer (class) – vectorizer class.

  • num_clusters (int, (default=5)) – size of unsupervised clusters.

  • titles (List[str], (default=None)) – list of titles, length must same with corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

  • cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.

  • random_samples (float, (default=0.3)) – random samples from the corpus, 0.3 means 30%.

  • ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.

  • batch_size (int, (default=20)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.

Returns

dictionary

Return type

{‘linkage_matrix’: linkage_matrix, ‘titles’: titles}

malaya.cluster.cluster_graph(corpus: List[str], vectorizer, threshold: float = 0.9, num_clusters: int = 5, titles: List[str] = None, colors: List[str] = None, stopwords=<function get_stopwords>, ngram: Tuple[int, int] = (1, 3), cleaning=<function simple_textcleaning>, clustering=<class 'sklearn.cluster._kmeans.KMeans'>, figsize: Tuple[int, int] = (17, 9), with_labels: bool = True, batch_size: int = 20)[source]

plot undirected graph with similar texts.

Parameters
  • corpus (List[str]) –

  • vectorizer (class) – vectorizer class.

  • threshold (float, (default=0.9)) – 0.9 means, 90% above absolute pearson correlation.

  • num_clusters (int, (default=5)) – size of unsupervised clusters.

  • titles (List[str], (default=True)) – list of titles, length must same with corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str] or List[str] or Tuple[str].

  • cleaning (function, (default=malaya.texts.function.simple_textcleaning)) – function to clean the corpus.

  • ngram (Tuple[int, int], (default=(1,3))) – n-grams size to train a corpus.

  • batch_size (int, (default=20)) – size of strings for each vectorization and attention. Only useful if use transformer vectorizer.

Returns

dictionary

Return type

{‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels}

malaya.cluster.cluster_entity_linking(corpus: List[str], vectorizer, entity_model, topic_modeling_model, threshold: float = 0.3, topic_decomposition: int = 2, topic_length: int = 10, fuzzy_ratio: int = 70, accepted_entities: List[str] = ['law', 'location', 'organization', 'person', 'event'], cleaning=<function simple_textcleaning>, colors: List[str] = None, stopwords=<function get_stopwords>, max_df: float = 1.0, min_df: int = 1, ngram: Tuple[int, int] = (2, 3), figsize: Tuple[int, int] = (17, 9), batch_size: int = 20)[source]

plot undirected graph for Entities and topics relationship.

Parameters
  • corpus (list or str) –

  • vectorizer (class) –

  • titles (list) – list of titles, length must same with corpus.

  • colors (list) – list of colors, length must same with num_clusters.

  • threshold (float, (default=0.3)) – 0.3 means, 30% above absolute pearson correlation.

  • topic_decomposition (int, (default=2)) – size of decomposition.

  • topic_length (int, (default=10)) – size of topic models.

  • fuzzy_ratio (int, (default=70)) – size of ratio for fuzzywuzzy.

  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.

  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.

  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.

  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str] or List[str] or Tuple[str]

Returns

dictionary

Return type

{‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels}

malaya.constituency

malaya.constituency.available_transformer()[source]

List available transformer models.

malaya.constituency.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]

Load Transformer Constituency Parsing model, transfer learning Transformer + self attentive parsing.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.Constituency class

malaya.coref

malaya.coref.parse_from_dependency(models, string: str, references: List[str] = ['dia', 'itu', 'ini', 'saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'], rejected_references: List[str] = ['saya', 'awak', 'kamu', 'kita', 'kami', 'mereka', 'nya'], acceptable_subjects: List[str] = ['flat', 'subj', 'nsubj', 'csubj', 'obj'], acceptable_nested_subjects: List[str] = ['compound', 'flat'], split_nya: bool = True, aggregate: Callable = <function mean>, top_k: int = 20)[source]

Apply Coreference Resolution using stacks of dependency models.

Parameters
  • models (list) – list of dependency models, must has vectorize method.

  • string (str) –

  • references (List[str], optional (default=['dia', 'itu', 'ini', 'saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'])) – list of references.

  • rejected_references (List[str], optional (default=['saya', 'awak', 'kamu', 'kita', 'kami', 'mereka'])) – list of rejected references during populating subjects.

  • acceptable_subjects (List[str], optional) – List of dependency labels for subjects.

  • acceptable_nested_subjects (List[str], optional) – List of dependency labels for nested subjects, eg, syarikat (obl) facebook (compound).

  • split_nya (bool, optional (default=True)) – split nya, eg, disifatkannya -> disifatkan, nya.

  • aggregate (Callable, optional (default=numpy.mean)) – Aggregate function to aggregate list of vectors from model.vectorize.

  • top_k (int, optional (default=20)) – only accept near top_k to assume a coherence.

Returns

result – {‘text’: [‘Husein’,’Zolkepli’,’suka’,’makan’,’ayam’,’.’,’Dia’,’pun’,’suka’,’makan’,’daging’,’.’], ‘coref’: {6: {‘index’: [0, 1], ‘text’: [‘Husein’, ‘Zolkepli’]}}}

Return type

Dict[text, coref]

malaya.dependency

malaya.dependency.describe()[source]

Describe Dependency supported.

malaya.dependency.dependency_graph(tagging, indexing)[source]

Return helper object for dependency parser results. Only accept tagging and indexing outputs from dependency models.

malaya.dependency.available_transformer(version: str = 'v2')[source]

List available transformer dependency parsing models.

Parameters

version (str, optional (default='v2')) –

Version supported. Allowed values:

  • 'v1' - version 1, maintain for knowledge graph.

  • 'v2' - Trained on bigger dataset, better version.

malaya.dependency.transformer(version: str = 'v2', model: str = 'xlnet', quantized: bool = False, **kwargs)[source]

Load Transformer Dependency Parsing model, transfer learning Transformer + biaffine attention.

Parameters
  • version (str, optional (default='v2')) –

    Version supported. Allowed values:

    • 'v1' - version 1, maintain for knowledge graph.

    • 'v2' - Trained on bigger dataset, better version.

  • model (str, optional (default='xlnet')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.DependencyBERT.

  • if xlnet in model, will return malaya.model.xlnet.DependencyXLNET.

Return type

model

malaya.emotion

malaya.emotion.available_transformer()[source]

List available transformer emotion analysis models.

malaya.emotion.multinomial(**kwargs)[source]

Load multinomial emotion model.

Returns

result

Return type

malaya.model.ml.MulticlassBayes class

malaya.emotion.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]

Load Transformer emotion model.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.MulticlassBERT.

  • if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET.

Return type

model

malaya.entity

malaya.entity.describe()[source]

Describe Entities supported.

malaya.entity.describe_ontonotes5()[source]

Describe OntoNotes5 Entities supported. https://spacy.io/api/annotation#named-entities

malaya.entity.available_transformer()[source]

List available transformer Entity Tagging models.

malaya.entity.available_transformer_ontonotes5()[source]

List available transformer Entity Tagging models trained on Ontonotes 5 Bahasa.

malaya.entity.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]

Load Transformer Entity Tagging model, transfer learning Transformer + CRF.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.TaggingBERT.

  • if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.

Return type

model

malaya.entity.transformer_ontonotes5(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]

Load Transformer Entity Tagging model trained on Ontonotes 5 Bahasa, transfer learning Transformer + CRF.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.TaggingBERT.

  • if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.

Return type

model

malaya.entity.general_entity(model=None)[source]

Load Regex based general entities tagging along with another supervised entity tagging model.

Parameters

model (object) – model must have predict method. Make sure the predict method returned [(string, label), (string, label)].

Returns

result

Return type

malaya.text.entity.EntityRegex class

malaya.generator

malaya.generator.ngrams(sequence, n: int, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None)[source]

generate ngrams.

Parameters
  • sequence (List[str]) – list of tokenize words.

  • n (int) – ngram size

Returns

result

Return type

List[Tuple[str, str]]

malaya.generator.pos_entities_ngram(result_pos: List[Tuple[str, str]], result_entities: List[Tuple[str, str]], ngram: Tuple[int, int] = (1, 3), accept_pos: List[str] = ['NOUN', 'PROPN', 'VERB'], accept_entities: List[str] = ['law', 'location', 'organization', 'person', 'time'])[source]

generate ngrams.

Parameters
  • result_pos (List[Tuple[str, str]]) – result from POS recognition.

  • result_entities (List[Tuple[str, str]]) – result of Entities recognition.

  • ngram (Tuple[int, int]) – ngram sizes.

  • accept_pos (List[str]) – accepted POS elements.

  • accept_entities (List[str]) – accept entities elements.

Returns

result

Return type

list

malaya.generator.sentence_ngram(sentence: str, ngram: Tuple[int, int] = (1, 3))[source]

generate ngram for a text

Parameters
  • sentence (str) –

  • ngram (tuple) – ngram sizes.

Returns

result

Return type

list

malaya.generator.shortform(word: str, augment_vowel: bool = True, augment_consonant: bool = True, prob_delete_vowel: float = 0.5, **kwargs)[source]

augmenting a formal word into socialmedia form. Purposely typo, purposely delete some vowels, purposely replaced some subwords into slang subwords.

Parameters
  • word (str) –

  • augment_vowel (bool, (default=True)) – if True, will augment vowels for each samples generated.

  • augment_consonant (bool, (default=True)) – if True, will augment consonants for each samples generated.

  • prob_delete_vowel (float, (default=0.5)) – probability to delete a vowel.

Returns

result

Return type

list

malaya.generator.babble(string: str, model, generate_length: int = 30, leed_out_len: int = 1, temperature: float = 1.0, top_k: int = 100, burnin: int = 15, batch_size: int = 5)[source]

Use pretrained transformer models to generate a string given a prefix string. https://github.com/nyu-dl/bert-gen, https://arxiv.org/abs/1902.04094

Parameters
  • string (str) –

  • model (object) – transformer interface object. Right now only supported BERT, ALBERT.

  • generate_length (int, optional (default=256)) – length of sentence to generate.

  • leed_out_len (int, optional (default=1)) – length of extra masks for each iteration.

  • temperature (float, optional (default=1.0)) – logits * temperature.

  • top_k (int, optional (default=100)) – k for top-k sampling.

  • burnin (int, optional (default=15)) – for the first burnin steps, sample from the entire next word distribution, instead of top_k.

  • batch_size (int, optional (default=5)) – generate sentences size of batch_size.

Returns

result

Return type

List[str]

malaya.generator.available_gpt2()[source]

List available gpt2 generator models.

malaya.generator.gpt2(model: str = '345M', generate_length: int = 256, temperature: float = 1.0, top_k: int = 40, **kwargs)[source]

Load GPT2 model to generate a string given a prefix string.

Parameters
  • model (str, optional (default='345M')) –

    Model architecture supported. Allowed values:

    • '117M' - GPT2 117M parameters.

    • '345M' - GPT2 345M parameters.

  • generate_length (int, optional (default=256)) – length of sentence to generate.

  • temperature (float, optional (default=1.0)) – temperature value, value should between 0 and 1.

  • top_k (int, optional (default=40)) – top-k in nucleus sampling selection.

Returns

result

Return type

malaya.transformers.gpt2.Model class

malaya.generator.available_transformer()[source]

List available transformer models.

malaya.generator.transformer(model: str = 't5', quantized: bool = False, **kwargs)[source]

Load Transformer model to generate a string given a isu penting.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 't5' - T5 BASE parameters.

    • 'small-t5' - T5 SMALL parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if t5 in model, will return malaya.model.t5.Generator.

Return type

model

malaya.keyword_extraction

malaya.keyword_extraction.rake(string: str, model=None, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=<function get_stopwords>, **kwargs)[source]

Extract keywords using Rake algorithm.

Parameters
  • string (str) –

  • model (Object, optional (default=None)) – Transformer model or any model has attention method.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str] For automatic Ngram generator.

Returns

result

Return type

Tuple[float, str]

malaya.keyword_extraction.textrank(string: str, model=None, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=<function get_stopwords>, **kwargs)[source]

Extract keywords using Textrank algorithm.

Parameters
  • string (str) –

  • model (Object, optional (default='None')) – model has fit_transform or vectorize method.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

Tuple[float, str]

malaya.keyword_extraction.attention(string: str, model, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=<function get_stopwords>, **kwargs)[source]

Extract keywords using Attention mechanism.

Parameters
  • string (str) –

  • model (Object) – Transformer model or any model has attention method.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

Tuple[float, str]

malaya.keyword_extraction.similarity(string: str, model, vectorizer=None, top_k: int = 5, atleast: int = 1, stopwords=<function get_stopwords>, **kwargs)[source]

Extract keywords using Sentence embedding VS keyword embedding similarity.

Parameters
  • string (str) –

  • model (Object) – Transformer model or any model has vectorize method.

  • vectorizer (Object, optional (default=None)) – Prefer sklearn.feature_extraction.text.CountVectorizer or, malaya.text.vectorizer.SkipGramCountVectorizer. If None, will generate ngram automatically based on stopwords.

  • top_k (int, optional (default=5)) – return top-k results.

  • atleast (int, optional (default=1)) – at least count appeared in the string to accept as candidate.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

Tuple[float, str]

malaya.keyword_extraction.available_transformer()[source]

List available transformer keyword similarity model.

malaya.keyword_extraction.transformer(model: str = 'bert', quantized: bool = False, **kwargs)[source]

Load Transformer keyword similarity model.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.KeyphraseBERT.

  • if xlnet in model, will return malaya.model.xlnet.KeyphraseXLNET.

Return type

model

malaya.knowledge_graph

malaya.knowledge_graph.parse_from_dependency(tagging: List[Tuple[str, str]], indexing: List[Tuple[str, str]], subjects: List[List[str]] = [['flat', 'subj', 'nsubj', 'csubj']], relations: List[List[str]] = [['acl', 'xcomp', 'ccomp', 'obj', 'conj', 'advcl'], ['obj']], objects: List[List[str]] = [['obj', 'compound', 'flat', 'nmod', 'obl']], get_networkx: bool = True)[source]

Generate knowledge graphs from dependency parsing, we suggest use dependency parsing v1.

Parameters
  • tagging (List[Tuple(str, str)]) – tagging result from dependency model.

  • indexing (List[Tuple(str, str)]) – indexing result from dependency model.

  • subjects (List[List[str]], optional) – List of dependency labels for subjects.

  • relations (List[List[str]], optional) – List of dependency labels for relations.

  • objects (List[List[str]], optional) – List of dependency labels for objects.

  • get_networkx (bool, optional (default=True)) – If True, will generate networkx.MultiDiGraph.

Returns

result

Return type

Dict[result, G]

malaya.knowledge_graph.available_transformer()[source]

List available transformer models.

malaya.knowledge_graph.transformer(model: str = 'base', quantized: bool = False, **kwargs)[source]

Load transformer to generate knowledge graphs in triplet format from texts, MS text -> EN triplet format.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'base' - Transformer BASE parameters.

    • 'large' - Transformer LARGE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.KnowledgeGraph class

malaya.language_detection

malaya.language_detection.fasttext(quantized: bool = True, **kwargs)[source]

Load Fasttext language detection model. Original size is 353MB, Quantized size 31.1MB.

Parameters

quantized (bool, optional (default=True)) – if True, load quantized fasttext model. Else, load original fasttext model.

Returns

result

Return type

malaya.model.ml.LanguageDetection class

malaya.language_detection.deep_model(quantized: bool = False, **kwargs)[source]

Load deep learning language detection model. Original size is 51.2MB, Quantized size 12.8MB.

quantizedbool, optional (default=False)

if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.DeepLang class

malaya.lexicon

malaya.lexicon.random_walk(lexicon: Dict[str, List[str]], wordvector, pool_size: int = 10, top_n: int = 20, similarity_power: float = 10.0, beta: float = 0.9, arccos: bool = True, normalization: bool = True, soft: bool = False, silent: bool = False)[source]

Induce lexicon by using random walk technique, use in paper, https://arxiv.org/pdf/1606.02820.pdf

Parameters
  • lexicon (Dict[str : List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.

  • wordvector (object) – wordvector interface object.

  • pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.

  • top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.

  • similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.

  • beta (float, optional (default=0.9)) – penalty score, towards to 1.0 means less penalty. 0 < beta < 1.

  • arccos (bool, optional (default=True)) – covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.

  • normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.

  • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.

  • silent (bool, optional (default=False)) – if True, will not print any logs.

Returns

result

Return type

tuple(labels[argmax(scores), axis = 1], scores, labels)

malaya.lexicon.propagate_probabilistic(lexicon: Dict[str, List[str]], wordvector, pool_size: int = 10, top_n: int = 20, similarity_power: float = 10.0, arccos: bool = True, normalization: bool = True, soft: bool = False, silent: bool = False)[source]

Learns polarity scores via standard label propagation from lexicon sets.

Parameters
  • lexicon (Dict[str, List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.

  • wordvector (object) – wordvector interface object.

  • pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.

  • top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.

  • similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.

  • arccos (bool, optional (default=True)) – covariance distribution for embedded.dot(embedded.T). If false, covariance + 1.

  • normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.

  • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.

  • silent (bool, optional (default=False)) – if True, will not print any logs.

Returns

result

Return type

tuple(labels[argmax(scores), axis = 1], scores, labels)

malaya.lexicon.propagate_graph(lexicon: Dict[str, List[str]], wordvector, pool_size: int = 10, top_n: int = 20, similarity_power: float = 10.0, normalization: bool = True, soft: bool = False, silent: bool = False)[source]

Graph propagation method dapted from Velikovich, Leonid, et al. “The viability of web-derived polarity lexicons.” http://www.aclweb.org/anthology/N10-1119

Parameters
  • lexicon (Dict[str, List[str]]) – curated lexicon from expert domain, {‘label1’: [str], ‘label2’: [str]}.

  • wordvector (object) – wordvector interface object.

  • pool_size (int, optional (default=10)) – pick top-pool size from each lexicons.

  • top_n (int, optional (default=20)) – top_n for each vectors will multiple with similarity_power.

  • similarity_power (float, optional (default=10.0)) – extra score for top_n, less will generate less bias induced but high chance unbalanced outcome.

  • normalization (bool, optional (default=True)) – normalize word vectors using L2 norm. L2 is good to penalize skewed vectors.

  • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest jarowrinkler ratio. if False, it will throw an exception if a word not in the dictionary.

  • silent (bool, optional (default=False)) – if True, will not print any logs.

Returns

result

Return type

tuple(labels[argmax(scores), axis = 1], scores, labels)

malaya.normalize

malaya.normalize.normalizer(speller=None, **kwargs)[source]

Load a Normalizer using any spelling correction model.

Parameters

speller (spelling correction object, optional (default = None)) –

Returns

result

Return type

malaya.normalize.Normalizer class

class malaya.normalize.Normalizer[source]
normalize(string: str, check_english: bool = True, normalize_text: bool = True, normalize_entity: bool = True, normalize_url: bool = False, normalize_email: bool = False, normalize_year: bool = True, normalize_telephone: bool = True, logging: bool = False)[source]

Normalize a string.

Parameters
  • string (str) –

  • check_english (bool, (default=True)) – check a word in english dictionary.

  • normalize_text (bool, (default=True)) – if True, will try to replace shortforms with internal corpus.

  • normalize_entity (bool, (default=True)) – normalize entities, only effect date, datetime, time and money patterns string only.

  • normalize_url (bool, (default=False)) – if True, replace :// with empty and . with dot. https://huseinhouse.com -> https huseinhouse dot com.

  • normalize_email (bool, (default=False)) – if True, replace @ with di, . with dot. husein.zol05@gmail.com -> husein dot zol kosong lima di gmail dot com.

  • normalize_year (bool, (default=True)) – if True, tahun 1987 -> tahun sembilan belas lapan puluh tujuh. if True, 1970-an -> sembilan belas tujuh puluh an. if False, tahun 1987 -> tahun seribu sembilan ratus lapan puluh tujuh.

  • normalize_telephone (bool, (default=True)) – if True, no 012-1234567 -> no kosong satu dua, satu dua tiga empat lima enam tujuh

  • logging (bool, (default=False)) – if True, will log index and token queue using logging.warn.

Returns

string

Return type

normalized string

malaya.nsfw

malaya.nsfw.lexicon(**kwargs)[source]

Load Lexicon NSFW model.

Returns

result

Return type

malaya.text.lexicon.nsfw.Lexicon class

malaya.nsfw.multinomial(**kwargs)[source]

Load multinomial NSFW model.

Returns

result

Return type

malaya.model.ml.BAYES class

malaya.num2word

malaya.num2word.to_cardinal(number)[source]

Translate from number input to cardinal text representation

Parameters

number (real number) –

Returns

result – cardinal representation

Return type

str

malaya.num2word.to_ordinal(number)[source]

Translate from number input to ordinal text representation

Parameters

number (real number) –

Returns

result – ordinal representation

Return type

str

malaya.num2word.to_ordinal_num(number)[source]

Translate from number input to ordinal numering text representation

Parameters

number (int) –

Returns

result – ordinal numering representation

Return type

str

malaya.num2word.to_currency(value)[source]

Translate from number input to cardinal currency text representation

Parameters

number (int) –

Returns

result – cardinal currency representation

Return type

str

malaya.num2word.to_year(value)[source]

Translate from number input to cardinal year text representation

Parameters

number (int) –

Returns

result – cardinal year representation

Return type

str

malaya.paraphrase

malaya.paraphrase.available_transformer()[source]

List available transformer models.

malaya.paraphrase.transformer(model: str = 't2t', quantized: bool = False, **kwargs)[source]

Load Malaya transformer encoder-decoder model to generate a paraphrase given a string.

Parameters
  • model (str, optional (default='t2t')) –

    Model architecture supported. Allowed values:

    • 't2t' - Malaya Transformer BASE parameters.

    • 'small-t2t' - Malaya Transformer SMALL parameters.

    • 't5' - T5 BASE parameters.

    • 'small-t5' - T5 SMALL parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if t2t in model, will return malaya.model.tf.Paraphrase.

  • if t5 in model, will return malaya.model.t5.Paraphrase.

Return type

model

malaya.pos

malaya.pos.describe()[source]

Describe Part-Of-Speech supported.

malaya.pos.available_transformer()[source]

List available transformer Part-Of-Speech Tagging models.

malaya.pos.naive(string: str)[source]

Recognize POS in a string using Regex.

Parameters

string (str) –

Returns

string

Return type

List[Tuple[str, str]]

malaya.pos.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]

Load Transformer POS Tagging model, transfer learning Transformer + CRF.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.TaggingBERT.

  • if xlnet in model, will return malaya.model.xlnet.TaggingXLNET.

Return type

model

malaya.preprocessing

malaya.preprocessing.unpack_english_contractions(text)[source]

Replace English contractions in text str with their unshortened forms. N.B. The “‘d” and “‘s” forms are ambiguous (had/would, is/has/possessive), so are left as-is. Important Note: The function is taken from textacy (https://github.com/chartbeat-labs/textacy).

malaya.preprocessing.preprocessing(normalize: List[str] = ['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'], annotate: List[str] = ['allcaps', 'elongated', 'repeated', 'emphasis', 'censored', 'hashtag'], lowercase: bool = True, fix_unidecode: bool = True, expand_english_contractions: bool = True, translate_english_to_bm: bool = True, speller=None, segmenter=None, stemmer=None, **kwargs)[source]

Load Preprocessing class.

Parameters
  • normalize (list) – normalizing tokens, can check all supported normalizing at malaya.preprocessing.get_normalize().

  • annotate (list) – annonate tokens <open></open>, only accept [‘hashtag’, ‘allcaps’, ‘elongated’, ‘repeated’, ‘emphasis’, ‘censored’].

  • lowercase (bool) –

  • fix_unidecode (bool) –

  • expand_english_contractions (bool) – expand english contractions

  • translate_english_to_bm (bool) – translate english words to bahasa malaysia words

  • speller (object) – spelling correction object, need to have a method correct

  • segmenter (object) – segmentation object, need to have a method segment. If provide, it will expand hashtags, #mondayblues == monday blues

  • stemmer (object) – stemmer object, need to have a method stem. If provide, it will stem or lemmatize the string.

Returns

result

Return type

malaya.preprocessing.Preprocessing class

class malaya.preprocessing.Tokenizer[source]
tokenize(text)[source]

Tokenize string.

Parameters

text (str) –

Returns

result

Return type

List[str]

class malaya.preprocessing.Preprocessing[source]

malaya.qa

malaya.qa.available_transformer_squad()[source]

List available Transformer Span models.

malaya.qa.transformer_squad(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]

Load Transformer Span model trained on SQUAD V2 dataset.

Parameters
  • model (str, optional (default='xlnet')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.SQUAD class

malaya.relevancy

malaya.relevancy.available_transformer()[source]

List available transformer relevancy analysis models.

malaya.relevancy.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]

Load Transformer relevancy model.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

    • 'bigbird' - Google BigBird BASE parameters.

    • 'tiny-bigbird' - Malaya BigBird BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.MulticlassBERT.

  • if xlnet in model, will return malaya.model.xlnet.MulticlassXLNET.

  • if bigbird in model, will return malaya.model.xlnet.MulticlassBigBird.

Return type

model

malaya.segmentation

malaya.segmentation.viterbi(max_split_length: int = 20, **kwargs)[source]

Load Segmenter class using viterbi algorithm.

Parameters
  • max_split_length (int, (default=20)) – max length of words in a sentence to segment

  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.

Returns

result

Return type

malaya.segmentation.Segmenter class

malaya.segmentation.available_transformer()[source]

List available transformer models.

malaya.segmentation.transformer(model: str = 'small', quantized: bool = False, **kwargs)[source]

Load transformer encoder-decoder model to Segmentize.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'small' - Transformer SMALL parameters.

    • 'base' - Transformer BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.Segmentation class

class malaya.segmentation.Segmenter[source]
segment(strings: List[str])[source]

Segment strings. Example, “sayasygkan negarasaya” -> “saya sygkan negara saya”

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

malaya.sentiment

malaya.sentiment.available_transformer()[source]

List available transformer sentiment analysis models.

malaya.sentiment.multinomial(**kwargs)[source]

Load multinomial sentiment model.

Returns

result

Return type

malaya.model.ml.Bayes class

malaya.sentiment.transformer(model: str = 'bert', quantized: bool = False, **kwargs)[source]

Load Transformer sentiment model.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.BinaryBERT.

  • if xlnet in model, will return malaya.model.xlnet.BinaryXLNET.

Return type

model

malaya.spell

class malaya.spell.Probability(corpus, sp_tokenizer=None)[source]

The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation

P(word)[source]

Probability of word.

correct(word: str, **kwargs)[source]

Most probable spelling correction for word.

Parameters

word (str) –

Returns

result

Return type

str

correct_text(text: str)[source]

Correct all the words within a text, returning the corrected text.

Parameters

text (str) –

Returns

result

Return type

str

correct_match(match)[source]

Spell-correct word in match, and preserve proper upper/lower/title case.

correct_word(word: str)[source]

Spell-correct word in match, and preserve proper upper/lower/title case.

class malaya.spell.Symspell(model, verbosity, corpus, k=10)[source]

The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation

edit_step(word)[source]

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

{candidate1, candidate2}

edit_candidates(word)[source]

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

{candidate1, candidate2}

correct(word: str, **kwargs)[source]

Most probable spelling correction for word.

Parameters

word (str) –

Returns

result

Return type

str

correct_text(text: str)[source]

Correct all the words within a text, returning the corrected text.

Parameters

text (str) –

Returns

result

Return type

str

correct_match(match)[source]

Spell-correct word in match, and preserve proper upper/lower/title case.

malaya.spell.probability(sentence_piece: bool = False, **kwargs)[source]

Train a Probability Spell Corrector.

Parameters

sentence_piece (bool, optional (default=False)) – if True, reduce possible augmentation states using sentence piece.

Returns

result

Return type

malaya.spell.Probability class

malaya.spell.symspell(max_edit_distance_dictionary: int = 2, prefix_length: int = 7, term_index: int = 0, count_index: int = 1, top_k: int = 10, **kwargs)[source]

Train a symspell Spell Corrector.

Returns

result

Return type

malaya.spell.Symspell class

malaya.spell.transformer(model, sentence_piece: bool = False, **kwargs)[source]

Load a Transformer Spell Corrector. Right now only supported BERT and ALBERT.

Parameters

sentence_piece (bool, optional (default=False)) – if True, reduce possible augmentation states using sentence piece.

Returns

result

Return type

malaya.spell.Transformer class

class malaya.spell.Transformer[source]
correct(word: str, string: str, index: int = - 1, batch_size: int = 20)[source]

Correct a word within a text, returning the corrected word.

correct_text(text: str, batch_size: int = 20)[source]

Correct all the words within a text, returning the corrected text.

correct_word(word: str, string: str, batch_size: int = 20)[source]

Spell-correct word in match, and preserve proper upper/lower/title case.

class malaya.spell.Probability[source]

The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation

P(word)[source]

Probability of word.

correct(word: str, **kwargs)[source]

Most probable spelling correction for word.

Parameters

word (str) –

Returns

result

Return type

str

correct_text(text: str)[source]

Correct all the words within a text, returning the corrected text.

Parameters

text (str) –

Returns

result

Return type

str

correct_match(match)[source]

Spell-correct word in match, and preserve proper upper/lower/title case.

correct_word(word: str)[source]

Spell-correct word in match, and preserve proper upper/lower/title case.

class malaya.spell.Symspell[source]

The SymspellCorrector extends the functionality of symspeller, https://github.com/mammothb/symspellpy And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation

edit_step(word)[source]

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

{candidate1, candidate2}

edit_candidates(word)[source]

Generate candidates given a word.

Parameters

word (str) –

Returns

result

Return type

{candidate1, candidate2}

correct(word: str, **kwargs)[source]

Most probable spelling correction for word.

Parameters

word (str) –

Returns

result

Return type

str

correct_text(text: str)[source]

Correct all the words within a text, returning the corrected text.

Parameters

text (str) –

Returns

result

Return type

str

correct_match(match)[source]

Spell-correct word in match, and preserve proper upper/lower/title case.

malaya.stack

malaya.stack.voting_stack(models, text: str)[source]

Stacking for POS, Entities and Dependency models.

Parameters
  • models (list) – list of models.

  • text (str) – string to predict.

Returns

result

Return type

list

malaya.stack.predict_stack(models, strings: List[str], aggregate: Callable = <function gmean>, **kwargs)[source]

Stacking for predictive models.

Parameters
  • models (List[Callable]) – list of models.

  • strings (List[str]) –

  • aggregate (Callable, optional (default=scipy.stats.mstats.gmean)) – Aggregate function.

Returns

result

Return type

dict

malaya.stem

malaya.stem.naive()[source]

Load stemming model using startswith and endswith naively using regex patterns.

Returns

result

Return type

malaya.stem.Naive class

malaya.stem.sastrawi()[source]

Load stemming model using Sastrawi, this also include lemmatization.

Returns

result

Return type

malaya.stem.Sastrawi class

malaya.stem.deep_model(quantized: bool = False, **kwargs)[source]

Load LSTM + Bahdanau Attention stemming model, this also include lemmatization. Original size 41.6MB, quantized size 10.6MB .

Parameters

quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.stem.DeepStemmer class

class malaya.stem.DeepStemmer[source]
stem(string: str, beam_search: bool = False)[source]

Stem a string, this also include lemmatization.

Parameters
  • string (str) –

  • beam_search (bool, (optional=False)) – If True, use beam search decoder, else use greedy decoder.

Returns

result

Return type

str

class malaya.stem.Sastrawi[source]
class malaya.stem.Naive[source]

malaya.subjectivity

malaya.subjectivity.available_transformer()[source]

List available transformer subjective analysis models.

malaya.subjectivity.multinomial(**kwargs)[source]

Load multinomial subjectivity model.

Parameters

validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.

Returns

result

Return type

malaya.model.ml.Bayes class

malaya.subjectivity.transformer(model: str = 'bert', quantized: bool = False, **kwargs)[source]

Load Transformer subjectivity model.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.BinaryBERT.

  • if xlnet in model, will return malaya.model.xlnet.BinaryXLNET.

Return type

model

malaya.tatabahasa

malaya.tatabahasa.describe()[source]

Describe kesalahan tatabahasa supported. Full description at https://tatabahasabm.tripod.com/tata/salahtata.htm

malaya.tatabahasa.available_transformer()[source]

List available transformer models.

malaya.tatabahasa.transformer(model: str = 'base', quantized: bool = False, **kwargs)[source]

Load Malaya transformer encoder-decoder + tagging model to correct a kesalahan tatabahasa text.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'small' - Malaya Transformer Tag SMALL parameters.

    • 'base' - Malaya Transformer Tag BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.Tatabahasa class

malaya.summarization.abstractive

malaya.summarization.abstractive.available_transformer()[source]

List available transformer models.

malaya.summarization.abstractive.transformer(model: str = 't2t', quantized: bool = False, **kwargs)[source]

Load Malaya transformer encoder-decoder model to generate a summary given a string.

Parameters
  • model (str, optional (default='t2t')) –

    Model architecture supported. Allowed values:

    • 't2t' - Malaya Transformer BASE parameters.

    • 'small-t2t' - Malaya Transformer SMALL parameters.

    • 't2t-distill' - Distilled Malaya Transformer BASE parameters.

    • 't5' - T5 BASE parameters.

    • 'small-t5' - T5 SMALL parameters.

    • 'bigbird' - BigBird + Pegasus BASE parameters.

    • 'small-bigbird' - BigBird + Pegasus SMALL parameters.

    • 'pegasus' - Pegasus BASE parameters.

    • 'small-pegasus' - Pegasus SMALL parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if t2t in model, will return malaya.model.tf.Summarization.

  • if t5 in model, will return malaya.model.t5.Summarization.

  • if bigbird in model, will return malaya.model.bigbird.Summarization.

  • if pegasus in model, will return malaya.model.pegasus.Summarization.

Return type

model

malaya.summarization.extractive

malaya.summarization.extractive.encoder(vectorizer)[source]

Encoder interface for summarization.

Parameters

vectorizer (object) – encoder interface object, eg, BERT, XLNET, ALBERT, ALXLNET. should have vectorize method.

Returns

result

Return type

malaya.model.extractive_summarization.Encoder

malaya.summarization.extractive.doc2vec(wordvector)[source]

Doc2Vec interface for summarization.

Parameters

wordvector (object) – malaya.wordvector.WordVector object. should have get_vector_by_name method.

Returns

result

Return type

malaya.model.extractive_summarization.Doc2Vec

malaya.summarization.extractive.sklearn(model, vectorizer)[source]

sklearn interface for summarization.

Parameters
  • model (object) –

    Should have fit_transform method. Commonly:

    • sklearn.decomposition.TruncatedSVD - LSA algorithm.

    • sklearn.decomposition.LatentDirichletAllocation - LDA algorithm.

  • vectorizer (object) –

    Should have fit_transform method. Commonly:

    • sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algorithm.

    • sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm.

Returns

result

Return type

malaya.model.extractive_summarization.SKLearn

malaya.similarity

malaya.similarity.doc2vec_wordvector(wordvector)[source]

Doc2vec interface for text similarity using Word Vector.

Parameters

wordvector (object) – malaya.wordvector.WordVector object. should have get_vector_by_name method.

Returns

result

Return type

malaya.similarity.Doc2VecSimilarity

malaya.similarity.doc2vec_vectorizer(vectorizer)[source]

Doc2vec interface for text similarity using Encoder model.

Parameters

vectorizer (object) – encoder interface object, BERT, XLNET. should have vectorize method.

Returns

result

Return type

malaya.similarity.VectorizerSimilarity

malaya.similarity.available_transformer()[source]

List available transformer similarity models.

malaya.similarity.transformer(model: str = 'bert', quantized: bool = False, **kwargs)[source]

Load Transformer similarity model.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.SiameseBERT.

  • if xlnet in model, will return malaya.model.xlnet.SiameseXLNET.

Return type

model

class malaya.similarity.VectorizerSimilarity[source]
predict_proba(left_strings: List[str], right_strings: List[str], similarity: str = 'cosine')[source]

calculate similarity for two different batch of texts.

Parameters
  • left_strings (list of str) –

  • right_strings (list of str) –

  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.

    • 'euclidean' - euclidean similarity.

    • 'manhattan' - manhattan similarity.

Returns

result

Return type

List[float]

heatmap(strings: List[str], similarity: str = 'cosine', visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7))[source]

plot a heatmap based on output from bert similarity.

Parameters
  • strings (list of str) – list of strings.

  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.

    • 'euclidean' - euclidean similarity.

    • 'manhattan' - manhattan similarity.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

result – list of results

Return type

list

class malaya.similarity.Doc2VecSimilarity[source]
predict_proba(left_strings: List[str], right_strings: List[str], aggregation: Callable = <function mean>, similarity: str = 'cosine', soft: bool = False)[source]

calculate similarity for two different batch of texts.

Parameters
  • left_strings (list of str) –

  • right_strings (list of str) –

  • aggregation (Callable, optional (default=numpy.mean)) –

  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.

    • 'euclidean' - euclidean similarity.

    • 'manhattan' - manhattan similarity.

  • soft (bool, optional (default=False)) – word not inside word vector will replace with nearest word if True, else, will skip.

Returns

result

Return type

List[float]

heatmap(strings: List[str], aggregation: Callable = <function mean>, similarity: str = 'cosine', soft: bool = False, visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7))[source]

plot a heatmap based on output from bert similarity.

Parameters
  • strings (list of str) – list of strings

  • aggregation (Callable, optional (default=numpy.mean)) –

  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.

    • 'euclidean' - euclidean similarity.

    • 'manhattan' - manhattan similarity.

  • soft (bool, optional (default=True)) – word not inside word vector will replace with nearest word if True, else, will skip.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

result – list of results.

Return type

list

malaya.topic_model

malaya.topic_model.available_vectorizer()[source]

List available vectorizer topic modeling.

malaya.topic_model.sklearn(corpus: List[str], model, vectorizer, n_topics: int, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, **kwargs)[source]

Train a SKlearn model to do topic modelling based on corpus / list of strings given.

Parameters
  • corpus (list) –

  • model (object) –

    Should have fit_transform method. Commonly:

    • sklearn.decomposition.TruncatedSVD - LSA algorithm.

    • sklearn.decomposition.LatentDirichletAllocation - LDA algorithm.

    • sklearn.decomposition.NMF - NMF algorithm.

  • vectorizer (object) –

    Should have fit_transform method. Commonly:

    • sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algorithm.

    • sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm.

  • n_topics (int, (default=10)) – size of decomposition column.

  • cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

Returns

result

Return type

malaya.topic_modelling.Topic class

malaya.topic_model.lda2vec(corpus: List[str], vectorizer, n_topics: int = 10, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, window_size: int = 2, embedding_size: int = 128, epoch: int = 10, switch_loss: int = 1000, **kwargs)[source]

Train a LDA2Vec model to do topic modelling based on corpus / list of strings given.

Parameters
  • corpus (list) –

  • vectorizer (object) –

    Should have fit_transform method. Commonly:

    • sklearn.feature_extraction.text.TfidfVectorizer - TFIDF algorithm.

    • sklearn.feature_extraction.text.CountVectorizer - Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramCountVectorizer - Skip Gram Bag-of-Word algorithm.

    • malaya.text.vectorizer.SkipGramTfidfVectorizer - Skip Gram TFIDF algorithm.

  • n_topics (int, (default=10)) – size of decomposition column.

  • cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

  • embedding_size (int, (default=128)) – embedding size of lda2vec tensors.

  • epoch (int, (default=10)) – training iteration, how many loop need to train.

  • switch_loss (int, (default=3)) – baseline to switch from document based loss to document + word based loss.

Returns

result

Return type

malaya.topic_modelling.DeepTopic class

malaya.topic_model.attention(corpus: List[str], n_topics: int, vectorizer, cleaning=<function simple_textcleaning>, stopwords=<function get_stopwords>, ngram: Tuple[int, int] = (1, 3), batch_size: int = 10)[source]

Use attention from transformer model to do topic modelling based on corpus / list of strings given.

Parameters
  • corpus (list) –

  • n_topics (int, (default=10)) – size of decomposition column.

  • vectorizer (object) –

  • cleaning (function, (default=malaya.text.function.simple_textcleaning)) – function to clean the corpus.

  • stopwords (List[str], (default=malaya.texts.function.get_stopwords)) – A callable that returned a List[str], or a List[str], or a Tuple[str]

  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.

  • batch_size (int, (default=10)) – size of strings for each vectorization and attention.

Returns

result

Return type

malaya.topic_modelling.AttentionTopic class

class malaya.topic_model.AttentionTopic[source]
top_topics(len_topic: int, top_n: int = 10, return_df: bool = True)[source]

Print important topics based on decomposition.

Parameters
  • len_topic (int) – size of topics.

  • top_n (int, optional (default=10)) – top n of each topic.

  • return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.

get_topics(len_topic: int)[source]

Return important topics based on decomposition.

Parameters

len_topic (int) – size of topics.

Returns

result

Return type

List[str]

class malaya.topic_model.DeepTopic[source]
visualize_topics(notebook_mode: int = False, mds: str = 'pcoa')[source]

Print important topics based on decomposition.

Parameters

mds (str, optional (default='pcoa')) –

2D Decomposition. Allowed values:

  • 'pcoa' - Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)

  • 'mmds' - Dimension reduction via Multidimensional scaling

  • 'tsne' - Dimension reduction via t-distributed stochastic neighbor embedding

top_topics(len_topic: int, top_n: int = 10, return_df: bool = True)[source]

Print important topics based on decomposition.

Parameters
  • len_topic (int) – size of topics.

  • top_n (int, optional (default=10)) – top n of each topic.

  • return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.

get_topics(len_topic: int)[source]

Return important topics based on decomposition.

Parameters

len_topic (int) – size of topics.

Returns

result

Return type

List[str]

get_sentences(len_sentence: int, k: int = 0)[source]

Return important sentences related to selected column based on decomposition.

Parameters
  • len_sentence (int) –

  • k (int, (default=0)) – index of decomposition matrix.

Returns

result

Return type

List[str]

class malaya.topic_model.Topic[source]
visualize_topics(notebook_mode: bool = False, mds: str = 'pcoa')[source]

Print important topics based on decomposition.

Parameters

mds (str, optional (default='pcoa')) –

2D Decomposition. Allowed values:

  • 'pcoa' - Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)

  • 'mmds' - Dimension reduction via Multidimensional scaling

  • 'tsne' - Dimension reduction via t-distributed stochastic neighbor embedding

top_topics(len_topic: int, top_n: int = 10, return_df: bool = True)[source]

Print important topics based on decomposition.

Parameters
  • len_topic (int) – size of topics.

  • top_n (int, optional (default=10)) – top n of each topic.

  • return_df (bool, optional (default=True)) – return as pandas.DataFrame, else JSON.

get_topics(len_topic: int)[source]

Return important topics based on decomposition.

Parameters

len_topic (int) –

Returns

result

Return type

List[str]

get_sentences(len_sentence: int, k: int = 0)[source]

Return important sentences related to selected column based on decomposition.

Parameters
  • len_sentence (int) –

  • k (int, (default=0)) – index of decomposition matrix.

Returns

result

Return type

List[str]

malaya.toxicity

malaya.toxicity.available_transformer()[source]

List available transformer toxicity analysis models.

malaya.toxicity.multinomial(**kwargs)[source]

Load multinomial toxicity model.

Returns

result

Return type

malaya.model.ml.MultilabelBayes class

malaya.toxicity.transformer(model: str = 'xlnet', quantized: bool = False, **kwargs)[source]

Load Transformer toxicity model.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.SigmoidBERT.

  • if xlnet in model, will return malaya.model.xlnet.SigmoidXLNET.

Return type

model

malaya.transformer

malaya.transformer.available_transformer()[source]

List available transformer models.

malaya.transformer.load(model: str = 'electra', pool_mode: str = 'last', **kwargs)[source]

Load transformer model.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

    • 'electra' - Google ELECTRA BASE parameters.

    • 'small-electra' - Google ELECTRA SMALL parameters.

  • pool_mode (str, optional (default='last')) –

    Model logits architecture supported. Only usable if model in [‘xlnet’, ‘alxlnet’]. Allowed values:

    • 'last' - last of the sequence.

    • 'first' - first of the sequence.

    • 'mean' - mean of the sequence.

    • 'attn' - attention of the sequence.

Returns

result – List of model classes:

  • if bert in model, will return malaya.transformers.bert.Model.

  • if xlnet in model, will return malaya.transformers.xlnet.Model.

  • if albert in model, will return malaya.transformers.albert.Model.

  • if electra in model, will return malaya.transformers.electra.Model.

Return type

model

malaya.translation.en_ms

malaya.translation.en_ms.available_transformer()[source]

List available transformer models.

malaya.translation.en_ms.transformer(model: str = 'base', quantized: bool = False, **kwargs)[source]

Load transformer encoder-decoder model to translate EN-to-MS.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'small' - Transformer SMALL parameters.

    • 'base' - Transformer BASE parameters.

    • 'large' - Transformer LARGE parameters.

    • 'bigbird' - BigBird BASE parameters.

    • 'small-bigbird' - BigBird SMALL parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bigbird in model, return malaya.model.bigbird.Translation.

  • else, return malaya.model.tf.Translation.

Return type

model

malaya.translation.ms_en

malaya.translation.ms_en.available_transformer()[source]

List available transformer models.

malaya.translation.ms_en.transformer(model: str = 'base', quantized: bool = False, **kwargs)[source]

Load Transformer encoder-decoder model to translate MS-to-EN.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'small' - Transformer SMALL parameters.

    • 'base' - Transformer BASE parameters.

    • 'large' - Transformer LARGE parameters.

    • 'bigbird' - BigBird BASE parameters.

    • 'small-bigbird' - BigBird SMALL parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bigbird in model, return malaya.model.bigbird.Translation.

  • else, return malaya.model.tf.Translation.

Return type

model

malaya.true_case

malaya.true_case.available_transformer()[source]

List available transformer models.

malaya.true_case.transformer(model: str = 'base', quantized: bool = False, **kwargs)[source]

Load transformer encoder-decoder model to True Case.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'small' - Transformer SMALL parameters.

    • 'base' - Transformer BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya.model.tf.TrueCase class

malaya.word2num

malaya.word2num.word2num(string)[source]

Translate from string to number, eg ‘kesepuluh’ -> 10.

Parameters

string (str) –

Returns

result

Return type

int / float

malaya.wordvector

malaya.wordvector.available_wordvector()[source]

List available transformer models.

malaya.wordvector.load(model: str = 'wikipedia', **kwargs)[source]

Return malaya.wordvector.WordVector object.

Parameters

model (str, optional (default='wikipedia')) –

Model architecture supported. Allowed values:

  • 'wikipedia' - pretrained on Malay wikipedia word2vec size 256.

  • 'socialmedia' - pretrained on cleaned Malay twitter and Malay instagram size 256.

  • 'news' - pretrained on cleaned Malay news size 256.

  • 'combine' - pretrained on cleaned Malay news + Malay social media + Malay wikipedia size 256.

Returns

  • vocabulary (indices dictionary for vector.)

  • vector (np.array, 2D.)

class malaya.wordvector.WordVector[source]
get_vector_by_name(word: str, soft: bool = False, topn_soft: int = 5)[source]

get vector based on string.

Parameters
  • word (str) –

  • soft (bool, (default=True)) – if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary.

  • topn_soft (int, (default=5)) – if word not found in dictionary, will returned topn_soft size of similar size using jarowinkler.

Returns

vector

Return type

np.array, 1D

tree_plot(labels, figsize: Tuple[int, int] = (7, 7), annotate: bool = True)[source]

plot a tree plot based on output from calculator / n_closest / analogy.

Parameters
  • labels (list) – output from calculator / n_closest / analogy.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

  • embed (np.array, 2D.)

  • labelled (labels for X / Y axis.)

scatter_plot(labels, centre: str = None, figsize: Tuple[int, int] = (7, 7), plus_minus: int = 25, handoff: float = 5e-05)[source]

plot a scatter plot based on output from calculator / n_closest / analogy.

Parameters
  • labels (list) – output from calculator / n_closest / analogy

  • centre (str, (default=None)) – centre label, if a str, it will annotate in a red color.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

tsne

Return type

np.array, 2D.

batch_calculator(equations: List[str], num_closest: int = 5, return_similarity: bool = False)[source]

batch calculator parser for word2vec using tensorflow.

Parameters
  • equations (list of str) – Eg, ‘[(mahathir + najib) - rosmah]’

  • num_closest (int, (default=5)) – number of words closest to the result.

Returns

word_list

Return type

list of nearest words

calculator(equation: str, num_closest: int = 5, metric: str = 'cosine', return_similarity: bool = True)[source]

calculator parser for word2vec.

Parameters
  • equation (str) – Eg, ‘(mahathir + najib) - rosmah’

  • num_closest (int, (default=5)) – number of words closest to the result.

  • metric (str, (default='cosine')) – vector distance algorithm.

  • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.

Returns

word_list

Return type

list of nearest words

batch_n_closest(words: List[str], num_closest: int = 5, return_similarity: bool = False, soft: bool = True)[source]

find nearest words based on a batch of words using Tensorflow.

Parameters
  • words (list) – Eg, [‘najib’,’anwar’]

  • num_closest (int, (default=5)) – number of words closest to the result.

  • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.

  • soft (bool, (default=True)) – if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary.

Returns

word_list

Return type

list of nearest words

n_closest(word: str, num_closest: int = 5, metric: str = 'cosine', return_similarity: bool = True)[source]

find nearest words based on a word.

Parameters
  • word (str) – Eg, ‘najib’

  • num_closest (int, (default=5)) – number of words closest to the result.

  • metric (str, (default='cosine')) – vector distance algorithm.

  • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.

Returns

word_list

Return type

list of nearest words

analogy(a: str, b: str, c: str, num: int = 1, metric: str = 'cosine')[source]

analogy calculation, vb - va + vc.

Parameters
  • a (str) –

  • b (str) –

  • c (str) –

  • num (int, (default=1)) –

  • metric (str, (default='cosine')) – vector distance algorithm.

Returns

word_list

Return type

list of nearest words.

project_2d(start: int, end: int)[source]

project word2vec into 2d dimension.

Parameters
  • start (int) –

  • end (int) –

Returns

  • embed_2d (TSNE decomposition)

  • word_list (words in between start and end.)

network(word: str, num_closest: int = 8, depth: int = 4, min_distance: float = 0.5, iteration: int = 300, figsize: Tuple[int, int] = (15, 15), node_color: str = '#72bbd0', node_factor: int = 50)[source]

plot a social network based on word given

Parameters
  • word (str) – centre of social network.

  • num_closest (int, (default=8)) – number of words closest to the node.

  • depth (int, (default=4)) – depth of social network. More deeper more expensive to calculate, big^O(num_closest ** depth).

  • min_distance (float, (default=0.5)) – minimum distance among nodes. Increase the value to increase the distance among nodes.

  • iteration (int, (default=300)) – number of loops to train the social network to fit min_distace.

  • figsize (tuple, (default=(15, 15))) – figure size for plot.

  • node_color (str, (default='#72bbd0')) – color for nodes.

  • node_factor (int, (default=10)) – size factor for depth nodes. Increase this value will increase nodes sizes based on depth.

Returns

G

Return type

networkx graph object

malaya.zero_shot.classification

malaya.zero_shot.classification.available_transformer()[source]

List available transformer zero-shot models.

malaya.zero_shot.classification.transformer(model: str = 'bert', quantized: bool = False, **kwargs)[source]

Load Transformer zero-shot model.

Parameters
  • model (str, optional (default='bert')) –

    Model architecture supported. Allowed values:

    • 'bert' - Google BERT BASE parameters.

    • 'tiny-bert' - Google BERT TINY parameters.

    • 'albert' - Google ALBERT BASE parameters.

    • 'tiny-albert' - Google ALBERT TINY parameters.

    • 'xlnet' - Google XLNET BASE parameters.

    • 'alxlnet' - Malaya ALXLNET BASE parameters.

  • quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result – List of model classes:

  • if bert in model, will return malaya.model.bert.ZeroshotBERT.

  • if xlnet in model, will return malaya.model.xlnet.ZeroshotXLNET.

Return type

model

malaya.model.bert

class malaya.model.bert.BinaryBERT[source]
vectorize(strings: List[str], method: str = 'first')[source]

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings: List[str], add_neutral: bool = True)[source]

classify list of strings.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[str]

predict_proba(strings: List[str], add_neutral: bool = True)[source]

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[dict[str, float]]

predict_words(string: str, method: str = 'last', visualization: bool = True)[source]

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

result

Return type

dict

class malaya.model.bert.MulticlassBERT[source]
vectorize(strings: List[str], method: str = 'first')[source]

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings: List[str])[source]

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings: List[str])[source]

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

predict_words(string: str, method: str = 'last', visualization: bool = True)[source]

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

result

Return type

dict

class malaya.model.bert.SigmoidBERT[source]
vectorize(strings: List[str], method: str = 'first')[source]

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings: List[str])[source]

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[List[str]]

predict_proba(strings: List[str])[source]

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

predict_words(string: str, method: str = 'last', visualization: bool = True)[source]

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

dictionary

Return type

results

class malaya.model.bert.SiameseBERT[source]
vectorize(strings: List[str])[source]

Vectorize list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

predict_proba(strings_left: List[str], strings_right: List[str])[source]

calculate similarity for two different batch of texts.

Parameters
  • strings_left (List[str]) –

  • strings_right (List[str]) –

Returns

list

Return type

list of float

heatmap(strings: List[str], visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7))[source]

plot a heatmap based on output from similarity

Parameters
  • strings (list of str) – list of strings.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

result – list of results

Return type

list

class malaya.model.bert.TaggingBERT[source]
vectorize(string: str)[source]

vectorize a string.

Parameters

string (List[str]) –

Returns

result

Return type

np.array

analyze(string: str)[source]

Analyze a string.

Parameters

string (str) –

Returns

result

Return type

{‘words’: List[str], ‘tags’: [{‘text’: ‘text’, ‘type’: ‘location’, ‘score’: 1.0, ‘beginOffset’: 0, ‘endOffset’: 1}]}

predict(string: str)[source]

Tag a string.

Parameters

string (str) –

Returns

result

Return type

Tuple[str, str]

class malaya.model.bert.DependencyBERT[source]
vectorize(string: str)[source]

vectorize a string.

Parameters

string (List[str]) –

Returns

result

Return type

np.array

predict(string: str)[source]

Tag a string.

Parameters

string (str) –

Returns

result

Return type

Tuple

class malaya.model.bert.ZeroshotBERT[source]
vectorize(strings: List[str], labels: List[str], method: str = 'first')[source]

vectorize a string.

Parameters
  • strings (List[str]) –

  • labels (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict_proba(strings: List[str], labels: List[str])[source]

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • labels (List[str]) –

Returns

list

Return type

list of float

malaya.model.bigbird

class malaya.model.bigbird.MulticlassBigBird[source]
vectorize(strings: List[str], method: str = 'first')[source]

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings: List[str])[source]

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings: List[str])[source]

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

class malaya.model.bigbird.Translation[source]
greedy_decoder(strings: List[str])[source]

translate list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.bigbird.Summarization[source]
greedy_decoder(strings: List[str], temperature=0.3, postprocess: bool = True, **kwargs)[source]

Summarize strings using greedy decoder.

Parameters
  • strings (List[str]) –

  • temperature (float, (default=0.3)) – logits * -log(random.uniform) * temperature.

  • postprocess (bool, optional (default=True)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.

Returns

result

Return type

List[str]

nucleus_decoder(strings: List[str], top_p: float = 0.7, temperature: float = 0.3, postprocess: bool = True, **kwargs)[source]

Summarize strings using nucleus decoder.

Parameters
  • strings (List[str]) –

  • top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p.

  • temperature (float, (default=0.3)) – logits * -log(random.uniform) * temperature.

  • postprocess (bool, optional (default=True)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.

Returns

result

Return type

List[str]

malaya.model.extractive_summarization

class malaya.model.extractive_summarization.SKLearn[source]
word_level(corpus, isi_penting: str = None, window_size: int = 10, important_words: int = 10, **kwargs)[source]

Summarize list of strings / string on word level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • window_size (int, (default=10)) – window size for each word.

  • important_words (int, (default=10)) – number of important words.

Returns

dict

Return type

{‘top-words’, ‘cluster-top-words’, ‘score’}

sentence_level(corpus, isi_penting: str = None, top_k: int = 3, important_words: int = 10, **kwargs)[source]

Summarize list of strings / string on sentence level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • top_k (int, (default=3)) – number of summarized strings.

  • important_words (int, (default=10)) – number of important words.

Returns

dict

Return type

{‘summary’, ‘top-words’, ‘cluster-top-words’, ‘score’}

class malaya.model.extractive_summarization.Doc2Vec[source]
word_level(corpus, isi_penting: str = None, window_size: int = 10, aggregation=<function mean>, soft: bool = False, **kwargs)[source]

Summarize list of strings / string on sentence level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • window_size (int, (default=10)) – window size for each word.

  • aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.

  • soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.

Returns

dict

Return type

{‘score’}

sentence_level(corpus, isi_penting: str = None, top_k: int = 3, aggregation=<function mean>, soft: bool = False, **kwargs)[source]

Summarize list of strings / string on sentence level.

Parameters
  • corpus (str / List[str]) –

  • isi_penting (str, optional (default=None)) – if not None, will put priority based on isi_penting.

  • top_k (int, (default=3)) – number of summarized strings.

  • aggregation (Callable, optional (default=numpy.mean)) – Aggregation method for Doc2Vec.

  • soft (bool, optional (default=False)) – soft: bool, (default=True) if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will returned embedding full with zeros.

Returns

dict

Return type

{‘summary’, ‘score’}

class malaya.model.extractive_summarization.Encoder[source]

malaya.model.ml

class malaya.model.ml.MulticlassBayes[source]
predict(strings: List[str])[source]

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings: List[str])[source]

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

class malaya.model.ml.BinaryBayes[source]
predict(strings: List[str], add_neutral: bool = True)[source]

classify list of strings.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[str]

predict_proba(strings: List[str], add_neutral: bool = True)[source]

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[dict[str, float]]

class malaya.model.ml.MultilabelBayes[source]
predict(strings: List[str])[source]

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[List[str]]

predict_proba(strings: List[str])[source]

classify list of strings and return probability.

Parameters

strings (list) –

Returns

result

Return type

List[dict[str, float]]

malaya.model.pegasus

class malaya.model.pegasus.Summarization[source]
greedy_decoder(strings: List[str], temperature=0.3, postprocess: bool = True, **kwargs)[source]

Summarize strings using greedy decoder.

Parameters
  • strings (List[str]) –

  • temperature (float, (default=0.3)) – logits * -log(random.uniform) * temperature.

  • postprocess (bool, optional (default=True)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.

Returns

result

Return type

List[str]

nucleus_decoder(strings: List[str], top_p: float = 0.7, temperature: float = 0.3, postprocess: bool = True, **kwargs)[source]

Summarize strings using nucleus decoder.

Parameters
  • strings (List[str]) –

  • top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p.

  • temperature (float, (default=0.3)) – logits * -log(random.uniform) * temperature.

  • postprocess (bool, optional (default=True)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.

Returns

result

Return type

List[str]

malaya.model.t5

class malaya.model.t5.Summarization[source]
greedy_decoder(strings: List[str], mode: str = 'ringkasan', postprocess: bool = True, **kwargs)[source]

Summarize strings. Decoder is greedy decoder with beam width size 1, alpha 0.5 .

Parameters
  • strings (List[str]) –

  • mode (str) –

    mode for summarization. Allowed values:

    • 'ringkasan' - summarization for long sentence, eg, news summarization.

    • 'tajuk' - title summarization for long sentence, eg, news title.

  • postprocess (bool, optional (default=True)) – If True, will filter sentence generated using ROUGE score and removed international news publisher.

Returns

result

Return type

List[str]

class malaya.model.t5.Generator[source]
greedy_decoder(strings: List[str])[source]

generate a long text given a isi penting. Decoder is greedy decoder with beam width size 1, alpha 0.5 .

Parameters

strings (List[str]) –

Returns

result

Return type

str

class malaya.model.t5.Paraphrase[source]
greedy_decoder(strings: List[str], split_fullstop: bool = True)[source]

paraphrase strings. Decoder is greedy decoder with beam width size 1, alpha 0.5 .

Parameters
  • strings (List[str]) –

  • split_fullstop (bool, (default=True)) – if True, will generate paraphrase for each strings splitted by fullstop.

Returns

result

Return type

List[str]

malaya.model.tf

class malaya.model.tf.DeepLang[source]
predict(strings: List[str])[source]

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings: List[str])[source]

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

class malaya.model.tf.Translation[source]
greedy_decoder(strings: List[str])[source]

translate list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings: List[str])[source]

translate list of strings using beam decoder, beam width size 3, alpha 0.5 .

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.tf.Constituency[source]
vectorize(string: str)[source]

vectorize a string.

Parameters

string (List[str]) –

Returns

result

Return type

np.array

parse_nltk_tree(string: str)[source]

Parse a string into NLTK Tree, to make it useful, make sure you already installed tktinker.

Parameters

string (str) –

Returns

result

Return type

nltk.Tree object

parse_tree(string)[source]

Parse a string into string treebank format.

Parameters

string (str) –

Returns

result

Return type

malaya.text.trees.InternalTreebankNode class

class malaya.model.tf.TrueCase[source]
greedy_decoder(strings: List[str])[source]

True case strings using greedy decoder. Example, “saya nak makan di us makanan di sana sedap” -> “Saya nak makan di US, makanan di sana sedap.”

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings: List[str])[source]

True case strings using beam decoder, beam width size 3, alpha 0.5 . Example, “saya nak makan di us makanan di sana sedap” -> “Saya nak makan di US, makanan di sana sedap.”

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.tf.Segmentation[source]
greedy_decoder(strings: List[str])[source]

Segment strings using greedy decoder. Example, “sayasygkan negarasaya” -> “saya sygkan negara saya”

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings: List[str])[source]

Segment strings using beam decoder, beam width size 3, alpha 0.5 . Example, “sayasygkan negarasaya” -> “saya sygkan negara saya”

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.tf.Paraphrase[source]
greedy_decoder(strings: List[str], **kwargs)[source]

Paraphrase strings using greedy decoder.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

beam_decoder(strings: List[str], **kwargs)[source]

Paraphrase strings using beam decoder, beam width size 3, alpha 0.5 .

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

nucleus_decoder(strings: List[str], top_p: float = 0.7, **kwargs)[source]

Paraphrase strings using nucleus sampling.

Parameters
  • strings (List[str]) –

  • top_p (float, (default=0.7)) – cumulative distribution and cut off as soon as the CDF exceeds top_p.

Returns

result

Return type

List[str]

class malaya.model.tf.Tatabahasa[source]
greedy_decoder(strings: List[str])[source]

Fix kesalahan tatatabahasa.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

class malaya.model.tf.SQUAD[source]
predict(paragraph_text: str, question_texts: List[str], doc_stride: int = 128, max_query_length: int = 64, max_answer_length: int = 64, n_best_size: int = 20)[source]

Predict Span from questions given a paragraph.

Parameters
  • paragraph_text (str) –

  • question_texts (List[str]) – List of questions, results really depends on case sensitive questions.

  • doc_stride (int, optional (default=128)) – striding size to split a paragraph into multiple texts.

  • max_query_length (int, optional (default=64)) – Maximum length if question tokens.

  • max_answer_length (int, optional (default=30)) – Maximum length if answer tokens.

Returns

result

Return type

List[{‘text’: ‘text’, ‘start’: 0, ‘end’: 1}]

vectorize(strings: List[str], method: str = 'first')[source]

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

class malaya.model.tf.KnowledgeGraph[source]
greedy_decoder(strings: List[str], get_networkx: bool = True)[source]

Generate triples knowledge graph using greedy decoder. Example, “Joseph Enanga juga bermain untuk Union Douala.” -> “Joseph Enanga member of sports team Union Douala”

Parameters
  • strings (List[str]) –

  • get_networkx (bool, optional (default=True)) – If True, will generate networkx.MultiDiGraph.

Returns

result

Return type

List[Dict]

beam_decoder(strings: List[str], get_networkx: bool = True)[source]

Generate triples knowledge graph using beam decoder. Example, “Joseph Enanga juga bermain untuk Union Douala.” -> “Joseph Enanga member of sports team Union Douala”

Parameters
  • strings (List[str]) –

  • get_networkx (bool, optional (default=True)) – If True, will generate networkx.MultiDiGraph.

Returns

result

Return type

List[Dict]

malaya.model.xlnet

class malaya.model.xlnet.BinaryXLNET[source]
vectorize(strings: List[str], method: str = 'first')[source]

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings: List[str], add_neutral: bool = True)[source]

classify list of strings.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[str]

predict_proba(strings: List[str], add_neutral: bool = True)[source]

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.

Returns

result

Return type

List[dict[str, float]]

predict_words(string: str, method: str = 'last', visualization: bool = True)[source]

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

result

Return type

dict

class malaya.model.xlnet.MulticlassXLNET[source]
vectorize(strings: List[str], method: str = 'first')[source]

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings: List[str])[source]

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[str]

predict_proba(strings: List[str])[source]

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

predict_words(string: str, method: str = 'last', visualization: bool = True)[source]

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

result

Return type

dict

class malaya.model.xlnet.SigmoidXLNET[source]
vectorize(strings: List[str], method: str = 'first')[source]

vectorize list of strings.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict(strings: List[str])[source]

classify list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

List[List[str]]

predict_proba(strings: List[str])[source]

classify list of strings and return probability.

Parameters

strings (List[str]) –

Returns

result

Return type

List[dict[str, float]]

predict_words(string: str, method: str = 'last', visualization: bool = True)[source]

classify words.

Parameters
  • string (str) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.

Returns

dictionary

Return type

results

class malaya.model.xlnet.SiameseXLNET[source]
vectorize(strings: List[str])[source]

Vectorize list of strings.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

predict_proba(strings_left: List[str], strings_right: List[str])[source]

calculate similarity for two different batch of texts.

Parameters
  • string_left (List[str]) –

  • string_right (List[str]) –

Returns

result

Return type

List[float]

heatmap(strings: List[str], visualize: bool = True, annotate: bool = True, figsize: Tuple[int, int] = (7, 7))[source]

plot a heatmap based on output from similarity

Parameters
  • strings (list of str) – list of strings.

  • visualize (bool) – if True, it will render plt.show, else return data.

  • figsize (tuple, (default=(7, 7))) – figure size for plot.

Returns

result – list of results

Return type

list

class malaya.model.xlnet.TaggingXLNET[source]
vectorize(string: str)[source]

vectorize a string.

Parameters

string (List[str]) –

Returns

result

Return type

np.array

analyze(string: str)[source]

Analyze a string.

Parameters

string (str) –

Returns

result

Return type

{‘words’: List[str], ‘tags’: [{‘text’: ‘text’, ‘type’: ‘location’, ‘score’: 1.0, ‘beginOffset’: 0, ‘endOffset’: 1}]}

predict(string: str)[source]

Tag a string.

Parameters

string (str) –

Returns

result

Return type

Tuple[str, str]

class malaya.model.xlnet.DependencyXLNET[source]
vectorize(string: str)[source]

vectorize a string.

Parameters

string (List[str]) –

Returns

result

Return type

np.array

predict(string: str)[source]

Tag a string.

Parameters

string (str) –

Returns

result

Return type

Tuple

class malaya.model.xlnet.ZeroshotXLNET[source]
vectorize(strings: List[str], labels: List[str], method: str = 'first')[source]

vectorize a string.

Parameters
  • strings (List[str]) –

  • labels (List[str]) –

  • method (str, optional (default='first')) –

    Vectorization layer supported. Allowed values:

    • 'last' - vector from last sequence.

    • 'first' - vector from first sequence.

    • 'mean' - average vectors from all sequences.

    • 'word' - average vectors based on tokens.

Returns

result

Return type

np.array

predict_proba(strings: List[str], labels: List[str])[source]

classify list of strings and return probability.

Parameters
  • strings (List[str]) –

  • labels (List[str]) –

Returns

list

Return type

list of float

malaya.transformers.albert

malaya.transformers.albert.load(model: str = 'albert', **kwargs)[source]

Load albert model.

Parameters

model (str, optional (default='base')) –

Model architecture supported. Allowed values:

  • 'albert' - base albert-bahasa released by Malaya.

  • 'tiny-albert' - tiny bert-bahasa released by Malaya.

Returns

result

Return type

malaya.transformers.albert.Model class

class malaya.transformers.albert.Model[source]
vectorize(strings: List[str])[source]

Vectorize string inputs.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

attention(strings: List[str], method: str = 'last', **kwargs)[source]

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

Returns

result

Return type

List[List[Tuple[str, float]]]

visualize_attention(string: str)[source]

Visualize attention.

Parameters

string (str) –

malaya.transformers.alxlnet

malaya.transformers.alxlnet.load(model: str = 'alxlnet', pool_mode: str = 'last', **kwargs)[source]

Load alxlnet model.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'alxlnet' - XLNET architecture from google + Malaya.

  • pool_mode (str, optional (default='last')) –

    Model logits architecture supported. Allowed values:

    • 'last' - last of the sequence.

    • 'first' - first of the sequence.

    • 'mean' - mean of the sequence.

    • 'attn' - attention of the sequence.

Returns

result

Return type

malaya.transformers.alxlnet.Model class

class malaya.transformers.alxlnet.Model[source]
vectorize(strings: List[str])[source]

Vectorize string inputs.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

attention(strings: List[str], method: str = 'last', **kwargs)[source]

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

Returns

result

Return type

List[List[Tuple[str, float]]]

visualize_attention(string: str)[source]

Visualize attention.

Parameters

string (str) –

malaya.transformers.bert

malaya.transformers.bert.load(model: str = 'base', **kwargs)[source]

Load bert model.

Parameters

model (str, optional (default='base')) –

Model architecture supported. Allowed values:

  • 'bert' - base bert-bahasa released by Malaya.

  • 'tiny-bert' - tiny bert-bahasa released by Malaya.

Returns

result

Return type

malaya.transformers.bert.Model class

class malaya.transformers.bert.Model[source]
vectorize(strings: List[str])[source]

Vectorize string inputs.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

attention(strings: List[str], method: str = 'last', **kwargs)[source]

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

Returns

result

Return type

List[List[Tuple[str, float]]]

visualize_attention(string: str)[source]

Visualize attention.

Parameters

string (str) –

malaya.transformers.electra

malaya.transformers.electra.load(model: str = 'electra', **kwargs)[source]

Load electra model.

Parameters

model (str, optional (default='base')) –

Model architecture supported. Allowed values:

  • 'electra' - base electra-bahasa released by Malaya.

  • 'small-electra' - small electra-bahasa released by Malaya.

Returns

result

Return type

malaya.transformers.electra.Model class

class malaya.transformers.electra.Model[source]
vectorize(strings: List[str])[source]

Vectorize string inputs.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

attention(strings: List[str], method: str = 'last', **kwargs)[source]

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

Returns

result

Return type

List[List[Tuple[str, float]]]

visualize_attention(string: str)[source]

Visualize attention.

Parameters

string (str) –

malaya.transformers.gpt2

malaya.transformers.gpt2.load(model='345M', generate_length=100, temperature=1.0, top_k=40, **kwargs)[source]

Load gpt2 model.

Parameters
  • model (str, optional (default='345M')) –

    Model architecture supported. Allowed values:

    • '117M' - GPT2 117M parameters.

    • '345M' - GPT2 345M parameters.

  • generate_length (int, optional (default=256)) – length of sentence to generate.

  • temperature (float, optional (default=1.0)) – temperature value, value should between 0 and 1.

  • top_k (int, optional (default=40)) – top-k in nucleus sampling selection.

Returns

result

Return type

malaya.transformers.gpt2.Model class

class malaya.transformers.gpt2.Model[source]
generate(string: str)[source]

generate a text given an initial string.

Parameters

string (str) –

Returns

result

Return type

str

malaya.transformers.xlnet

malaya.transformers.xlnet.load(model: str = 'xlnet', pool_mode: str = 'last', **kwargs)[source]

Load xlnet model.

Parameters
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'xlnet' - XLNET architecture from google.

  • pool_mode (str, optional (default='last')) –

    Model logits architecture supported. Allowed values:

    • 'last' - last of the sequence.

    • 'first' - first of the sequence.

    • 'mean' - mean of the sequence.

    • 'attn' - attention of the sequence.

Returns

result

Return type

malaya.transformers.xlnet.Model class

class malaya.transformers.xlnet.Model[source]
vectorize(strings: List[str])[source]

Vectorize string inputs.

Parameters

strings (List[str]) –

Returns

result

Return type

np.array

attention(strings: List[str], method: str = 'last', **kwargs)[source]

Get attention string inputs.

Parameters
  • strings (List[str]) –

  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.

    • 'first' - attention from first layer.

    • 'mean' - average attentions from all layers.

Returns

result

Return type

List[List[Tuple[str, float]]]

visualize_attention(string: str)[source]

Visualize attention.

Parameters

string (str) –