API Reference

malaya

malaya.print_cache(location=None)[source]

Print cached data, this will print entire cache folder if let location = None

malaya.clear_all_cache()[source]

Remove cached data, this will delete entire cache folder

malaya.clear_cache(location)[source]

Remove selected cached data, please run malaya.print_cache() to get path

malaya.load_malay_dictionary()[source]

load 20k Pustaka dictionary.

Returns:list
Return type:list of strings
malaya.load_200k_malay_dictionary()[source]

load 200k words dictionary.

Returns:list
Return type:list of strings
malaya.describe_pos_malaya()[source]

Describe Malaya Part-Of-Speech supported (deprecated, use describe_pos() instead)

malaya.describe_pos()[source]

Describe Part-Of-Speech supported

malaya.describe_entities_malaya()[source]

Describe Malaya Entities supported (deprecated, use describe_entities() instead)

malaya.describe_entities()[source]

Describe Entities supported

malaya.describe_dependency()[source]

Describe Dependency supported

malaya.bert

malaya.bert.available_bert_model()[source]

List available bert models.

malaya.bert.bert(model='base', validate=True)[source]

Load bert model.

Parameters:
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'multilanguage' - bert multilanguage released by Google.
    • 'base' - base bert-bahasa released by Malaya.
    • 'small' - small bert-bahasa released by Malaya.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

BERT_MODEL

Return type:

malaya.bert._Model class

class malaya.bert._Model[source]
vectorize(strings)[source]

Vectorize string inputs using bert attention.

Parameters:strings (str / list of str) –
Returns:array
Return type:vectorized strings
attention(strings, method='last', **kwargs)[source]

Get attention string inputs from bert attention.

Parameters:
  • strings (str / list of str) –
  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.
    • 'first' - attention from first layer.
    • 'mean' - average attentions from all layers.
Returns:

array

Return type:

attention

malaya.cluster

malaya.cluster.cluster_words(list_words)[source]

cluster similar words based on structure, eg, [‘mahathir mohamad’, ‘mahathir’] = [‘mahathir mohamad’]

Parameters:list_words (list of str) –
Returns:string
Return type:list of clustered words
malaya.cluster.cluster_pos(result)[source]

cluster similar POS.

Parameters:result (list) –
Returns:result
Return type:list
malaya.cluster.cluster_tagging(result)[source]

cluster any tagging results, as long the data passed [(string, label), (string, label)].

Parameters:result (list) –
Returns:result
Return type:list
malaya.cluster.cluster_entities(result)[source]

cluster similar Entities.

Parameters:result (list) –
Returns:result
Return type:list
malaya.cluster.cluster_scatter(corpus, titles=None, colors=None, stemming=True, max_df=0.95, min_df=2, ngram=(1, 3), cleaning=<function simple_textcleaning>, vectorizer='bow', stop_words=None, num_clusters=5, clustering=<MagicMock id='140615676635344'>, decomposition=<MagicMock id='140615677000896'>, figsize=(17, 9))[source]

plot scatter plot on similar text clusters.

Parameters:
  • corpus (list) –
  • titles (list) – list of titles, length must same with corpus.
  • colors (list) – list of colors, length must same with num_clusters.
  • num_clusters (int, (default=5)) – size of unsupervised clusters.
  • stemming (bool, (default=True)) – If True, sastrawi_stemmer will apply.
  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
  • stop_words (list, (default=None)) – list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
  • vectorizer (str, (default='bow')) –

    vectorizer technique. Allowed values:

    • 'bow' - Bag of Word.
    • 'tfidf' - Term frequency inverse Document Frequency.
    • 'skip-gram' - Bag of Word with skipping certain n-grams.
Returns:

  • dictionary ({) – ‘X’: X, ‘Y’: Y, ‘labels’: clusters, ‘vector’: transformed_text_clean, ‘titles’: titles,
  • }

malaya.cluster.cluster_dendogram(corpus, titles=None, stemming=True, max_df=0.95, min_df=2, ngram=(1, 3), cleaning=<function simple_textcleaning>, vectorizer='bow', stop_words=None, random_samples=0.3, figsize=(17, 9), **kwargs)[source]

plot hierarchical dendogram with similar texts.

Parameters:
  • corpus (list) –
  • titles (list) – list of titles, length must same with corpus.
  • stemming (bool, (default=True)) – If True, sastrawi_stemmer will apply.
  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
  • stop_words (list, (default=None)) – list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
  • vectorizer (str, (default='bow')) –

    vectorizer technique. Allowed values:

    • 'bow' - Bag of Word.
    • 'tfidf' - Term frequency inverse Document Frequency.
    • 'skip-gram' - Bag of Word with skipping certain n-grams.
Returns:

dictionary

Return type:

{‘linkage_matrix’: linkage_matrix, ‘titles’: titles}

malaya.cluster.cluster_graph(corpus, titles=None, colors=None, threshold=0.3, stemming=True, max_df=0.95, min_df=2, ngram=(1, 3), cleaning=<function simple_textcleaning>, vectorizer='bow', stop_words=None, num_clusters=5, clustering=<MagicMock id='140615676635344'>, figsize=(17, 9), with_labels=True, **kwargs)[source]

plot undirected graph with similar texts.

Parameters:
  • corpus (list) –
  • titles (list) – list of titles, length must same with corpus.
  • colors (list) – list of colors, length must same with num_clusters.
  • threshold (float, (default=0.3)) – threshold to assume similarity for covariance matrix.
  • num_clusters (int, (default=5)) – size of unsupervised clusters.
  • stemming (bool, (default=True)) – If True, sastrawi_stemmer will apply.
  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
  • stop_words (list, (default=None)) – list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
  • vectorizer (str, (default='bow')) –

    vectorizer technique. Allowed values:

    • 'bow' - Bag of Word.
    • 'tfidf' - Term frequency inverse Document Frequency.
    • 'skip-gram' - Bag of Word with skipping certain n-grams.
Returns:

  • dictionary ({) – ‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels,
  • }

malaya.cluster.cluster_entity_linking(corpus, entity_model, topic_modeling_model, topic_decomposition=2, topic_length=10, threshold=0.3, fuzzy_ratio=70, accepted_entities=['law', 'location', 'organization', 'person', 'event'], colors=None, max_df=1.0, min_df=1, ngram=(2, 3), stemming=True, cleaning=<function simple_textcleaning>, vectorizer='bow', stop_words=None, figsize=(17, 9), **kwargs)[source]

plot undirected graph for Entities and topics relationship.

Parameters:
  • corpus (list or str) –
  • titles (list) – list of titles, length must same with corpus.
  • colors (list) – list of colors, length must same with num_clusters.
  • threshold (float, (default=0.3)) – threshold to assume similarity for covariance matrix.
  • topic_decomposition (int, (default=2)) – size of decomposition.
  • topic_length (int, (default=10)) – size of topic models.
  • fuzzy_ratio (int, (default=70)) – size of ratio for fuzzywuzzy.
  • stemming (bool, (default=True)) – If True, sastrawi_stemmer will apply.
  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
  • stop_words (list, (default=None)) – list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
  • vectorizer (str, (default='bow')) –

    vectorizer technique. Allowed values:

    • 'bow' - Bag of Word.
    • 'tfidf' - Term frequency inverse Document Frequency.
    • 'skip-gram' - Bag of Word with skipping certain n-grams.
Returns:

  • dictionary ({) – ‘G’: G, ‘pos’: pos, ‘node_colors’: node_colors, ‘node_labels’: node_labels,
  • }

malaya.dependency

malaya.dependency.dependency_graph(tagging, indexing)[source]

Return helper object for dependency parser results. Only accept tagging and indexing outputs from dependency models

malaya.dependency.available_deep_model()[source]

List available deep learning dependency models, [‘concat’, ‘bahdanau’, ‘luong’]

malaya.dependency.crf(validate=True)[source]

Load CRF dependency model.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:DEPENDENCY
Return type:malaya._models._sklearn_model.DEPENDENCY class
malaya.dependency.deep_model(model='bahdanau', validate=True)[source]

Load deep learning dependency model.

Parameters:
  • model (str, optional (default='bahdanau')) –

    Model architecture supported. Allowed values:

    • 'concat' - Concating character and word embedded for BiLSTM.
    • 'bahdanau' - Concating character and word embedded including Bahdanau Attention for BiLSTM.
    • 'luong' - Concating character and word embedded including Luong Attention for BiLSTM.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

DEPENDENCY

Return type:

malaya._models._tensorflow_model.DEPENDENCY class

malaya.emotion

malaya.emotion.available_deep_model()[source]

List available deep learning emotion analysis models.

malaya.emotion.available_bert_model()[source]

List available bert emotion analysis models.

malaya.emotion.deep_model(model='luong', validate=True)[source]

Load deep learning emotion analysis model.

Parameters:
  • model (str, optional (default='luong')) –

    Model architecture supported. Allowed values:

    • 'self-attention' - Fast-text architecture, embedded and logits layers only with self attention.
    • 'bahdanau' - LSTM with bahdanau attention architecture.
    • 'luong' - LSTM with luong attention architecture.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

SOFTMAX

Return type:

malaya._models._tensorflow_model.SOFTMAX class

malaya.emotion.multinomial(validate=True)[source]

Load multinomial emotion model.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:BAYES
Return type:malaya._models._sklearn_model.BAYES class
malaya.emotion.xgb(validate=True)[source]

Load XGB emotion model.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:XGB
Return type:malaya._models._sklearn_model.XGB class
malaya.emotion.bert(model='base', validate=True)[source]

Load BERT emotion model.

Parameters:
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'multilanguage' - bert multilanguage released by Google, trained on emotion analysis.
    • 'base' - base bert-bahasa released by Malaya, trained on emotion analysis.
    • 'small' - small bert-bahasa released by Malaya, trained on emotion analysis.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

MULTICLASS_BERT

Return type:

malaya._models._bert_model.MULTICLASS_BERT class

malaya.entity

malaya.entity.available_deep_model()[source]

List available deep learning entities models, [‘concat’, ‘bahdanau’, ‘luong’]

malaya.entity.available_bert_model()[source]

List available bert entities models, [‘multilanguage’, ‘base’, ‘small’]

malaya.entity.deep_model(model='bahdanau', validate=True)[source]

Load deep learning NER model.

Parameters:
  • model (str, optional (default='bahdanau')) –

    Model architecture supported. Allowed values:

    • 'concat' - Concating character and word embedded for BiLSTM.
    • 'bahdanau' - Concating character and word embedded including Bahdanau Attention for BiLSTM.
    • 'luong' - Concating character and word embedded including Luong Attention for BiLSTM.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

TAGGING

Return type:

malaya._models._tensorflow_model.TAGGING class

malaya.entity.bert(model='base', validate=True)[source]

Load BERT NER model.

Parameters:
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'multilanguage' - bert multilanguage released by Google, trained on NER.
    • 'base' - base bert-bahasa released by Malaya, trained on NER.
    • 'small' - small bert-bahasa released by Malaya, trained on NER.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

TAGGING_BERT

Return type:

malaya._models._tensorflow_model.TAGGING_BERT class

malaya.entity.general_entity(model=None)[source]

Load Regex based general entities tagging along with another supervised entity tagging model.

Parameters:model (object) – model must has predict method. Make sure the predict method returned [(string, label), (string, label)].
Returns:_Entity_regex
Return type:malaya.texts._entity._Entity_regex class

malaya.generator

malaya.generator.ngrams(sequence, n, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None)[source]

generate ngrams.

Parameters:
  • sequence (list of str) – list of tokenize words.
  • n (int) – ngram size
Returns:

ngram

Return type:

list

malaya.generator.pos_entities_ngram(result_pos, result_entities, ngram=(1, 3), accept_pos=['NOUN', 'PROPN', 'VERB'], accept_entities=['law', 'location', 'organization', 'person', 'time'])[source]

generate ngrams.

Parameters:
  • result_pos (list of tuple) – result from POS recognition.
  • result_entities (list of tuple) – result of Entities recognition.
  • ngram (tuple) – ngram sizes.
  • accept_pos (list of str) – accepted POS elements.
  • accept_entities (list of str) – accept entities elements.
Returns:

result

Return type:

list

malaya.generator.sentence_ngram(sentence, ngram=(1, 3))[source]

generate ngram for a text

Parameters:
  • sentence (str) –
  • ngram (tuple) – ngram sizes.
Returns:

result

Return type:

list

malaya.generator.w2v_augmentation(string, w2v, threshold=0.5, soft=False, random_select=True, augment_counts=1, top_n=5, cleaning_function=<function _simple_textcleaning>)[source]

augmenting a string using word2vec

Parameters:
  • string (str) –
  • w2v (object) – word2vec interface object.
  • threshold (float, optional (default=0.5)) – random selection for a word.
  • soft (bool, optional (default=False)) – if True, a word not in the dictionary will be replaced with nearest fuzzywuzzy ratio. if False, it will throw an exception if a word not in the dictionary.
  • random_select (bool, (default=True)) – if True, a word randomly selected in the pool. if False, based on the index
  • augment_counts (int, (default=1)) – augmentation count for a string.
  • top_n (int, (default=5)) – number of nearest neighbors returned.
  • cleaning_function (function, (default=malaya.generator._simple_textcleaning)) –
Returns:

result

Return type:

list

malaya.language_detection

malaya.language_detection.label()[source]

Return language labels dictionary.

malaya.language_detection.multinomial(validate=True)[source]

Load multinomial language detection model. :param validate: if True, malaya will check model availability and download if not available. :type validate: bool, optional (default=True)

Returns:LANGUAGE_DETECTION
Return type:malaya._models._sklearn_model.LANGUAGE_DETECTION class
malaya.language_detection.sgd(validate=True)[source]

Load SGD language detection model. :param validate: if True, malaya will check model availability and download if not available. :type validate: bool, optional (default=True)

Returns:LANGUAGE_DETECTION
Return type:malaya._models._sklearn_model.LANGUAGE_DETECTION class
malaya.language_detection.xgb(validate=True)[source]

Load XGB language detection model. :param validate: if True, malaya will check model availability and download if not available. :type validate: bool, optional (default=True)

Returns:LANGUAGE_DETECTION
Return type:malaya._models._sklearn_model.LANGUAGE_DETECTION class
malaya.language_detection.deep_model(validate=True)[source]

Load deep learning language detection model. :param validate: if True, malaya will check model availability and download if not available. :type validate: bool, optional (default=True)

Returns:DEEP_LANG
Return type:malaya._models._tensorflow_model.DEEP_LANG class

malaya.normalize

malaya.normalize.spell(speller)[source]

Train a Spelling Normalizer

Parameters:speller (Malaya spelling correction object) –
Returns:_SPELL_NORMALIZE
Return type:malaya.normalizer._SPELL_NORMALIZE class
class malaya.normalize._SPELL_NORMALIZE[source]
normalize(string, check_english=True)[source]

Normalize a string

Parameters:
  • string (str) –
  • check_english (bool, (default=True)) – check a word in english dictionary.
Returns:

string

Return type:

normalized string

malaya.num2word

malaya.num2word.to_cardinal(number)[source]

Translate from number input to cardinal text representation

Parameters:number (int) –
Returns:string
Return type:cardinal representation
malaya.num2word.to_ordinal(number)[source]

Translate from number input to ordinal text representation

Parameters:number (int) –
Returns:string
Return type:ordinal representation
malaya.num2word.to_ordinal_num(number)[source]

Translate from number input to ordinal numering text representation

Parameters:number (int) –
Returns:string
Return type:ordinal numering representation
malaya.num2word.to_currency(value)[source]

Translate from number input to cardinal currency text representation

Parameters:number (int) –
Returns:string
Return type:cardinal currency representation
malaya.num2word.to_year(value)[source]

Translate from number input to cardinal year text representation

Parameters:number (int) –
Returns:string
Return type:cardinal year representation

malaya.pos

malaya.pos.available_deep_model()[source]

List available deep learning entities models, [‘concat’, ‘bahdanau’, ‘luong’].

malaya.pos.available_bert_model()[source]

List available bert entities models, [‘multilanguage’, ‘base’, ‘small’]

malaya.pos.naive(string)[source]

Recognize POS in a string using Regex.

Parameters:string (str) –
Returns:string
Return type:tokenized string with POS related
malaya.pos.deep_model(model='concat', validate=True)[source]

Load deep learning POS Recognition model.

Parameters:
  • model (str, optional (default='bahdanau')) –

    Model architecture supported. Allowed values:

    • 'concat' - Concating character and word embedded for BiLSTM.
    • 'bahdanau' - Concating character and word embedded including Bahdanau Attention for BiLSTM.
    • 'luong' - Concating character and word embedded including Luong Attention for BiLSTM.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

TAGGING

Return type:

malaya.tensorflow_model.TAGGING class

malaya.pos.bert(model='base', validate=True)[source]

Load BERT POS model.

Parameters:
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'multilanguage' - bert multilanguage released by Google, trained on POS.
    • 'base' - base bert-bahasa released by Malaya, trained on POS.
    • 'small' - small bert-bahasa released by Malaya, trained on POS.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

TAGGING_BERT

Return type:

malaya._models._tensorflow_model.TAGGING_BERT class

malaya.preprocessing

malaya.preprocessing.unpack_english_contractions(text)[source]

Replace English contractions in text str with their unshortened forms. N.B. The “‘d” and “‘s” forms are ambiguous (had/would, is/has/possessive), so are left as-is. Important Note: The function is taken from textacy (https://github.com/chartbeat-labs/textacy).

malaya.preprocessing.preprocessing(normalize=['url', 'email', 'percent', 'money', 'phone', 'user', 'time', 'date', 'number'], annotate=['allcaps', 'elongated', 'repeated', 'emphasis', 'censored', 'hashtag'], lowercase=True, fix_unidecode=True, expand_hashtags=True, expand_english_contractions=True, translate_english_to_bm=True, remove_postfix=True, maxlen_segmenter=20, validate=True, speller=None)[source]

Load Preprocessing class.

Parameters:
  • normalize (list) – normalizing tokens, can check all supported normalizing at malaya.preprocessing.get_normalize()
  • annotate (list) – annonate tokens <open></open>, only accept [‘hashtag’, ‘allcaps’, ‘elongated’, ‘repeated’, ‘emphasis’, ‘censored’]
  • lowercase (bool) –
  • fix_unidecode (bool) –
  • expand_hashtags (bool) – expand hashtags using Viterbi algorithm, #mondayblues == monday blues
  • expand_english_contractions (bool) – expand english contractions
  • translate_english_to_bm (bool) – translate english words to bahasa malaysia words
  • remove_postfix (bool) – remove postfix from a word, faster way to get root word
  • speller (object) – spelling correction object, need to have a method correct
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

_Preprocessing

Return type:

malaya.preprocessing._Preprocessing class

malaya.preprocessing.segmenter(max_split_length=20, validate=True)[source]

Load Segmenter class.

Parameters:
  • max_split_length (int, (default=20)) – max length of words in a sentence to segment
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

_Segmenter

Return type:

malaya.preprocessing._Segmenter class

malaya.relevancy

malaya.relevancy.available_deep_model()[source]

List available deep learning relevancy analysis models.

malaya.relevancy.available_bert_model()[source]

List available bert relevancy analysis models.

malaya.relevancy.deep_model(model='self-attention', validate=True)[source]

Load deep learning relevancy analysis model.

Parameters:
  • model (str, optional (default='luong')) –

    Model architecture supported. Allowed values:

    • 'self-attention' - Fast-text architecture, embedded and logits layers only with self attention.
    • 'dilated-cnn' - Stack dilated CNN with self attention.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

SOFTMAX

Return type:

malaya._models._tensorflow_model.SOFTMAX class

malaya.relevancy.bert(model='base', validate=True)[source]

Load BERT relevancy model.

Parameters:
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'multilanguage' - bert multilanguage released by Google, trained on relevancy analysis.
    • 'base' - base bert-bahasa released by Malaya, trained on relevancy analysis.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

BERT

Return type:

malaya._models._bert_model.MULTICLASS_BERT class

malaya.sentiment

malaya.sentiment.available_deep_model()[source]

List available deep learning sentiment analysis models.

malaya.sentiment.available_bert_model()[source]

List available bert sentiment analysis models.

malaya.sentiment.deep_model(model='luong', validate=True)[source]

Load deep learning sentiment analysis model.

Parameters:
  • model (str, optional (default='luong')) –

    Model architecture supported. Allowed values:

    • 'self-attention' - Fast-text architecture, embedded and logits layers only with self attention.
    • 'bahdanau' - LSTM with bahdanau attention architecture.
    • 'luong' - LSTM with luong attention architecture.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

SOFTMAX

Return type:

malaya._models._tensorflow_model.SOFTMAX class

malaya.sentiment.multinomial(validate=True)[source]

Load multinomial sentiment model.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:BAYES
Return type:malaya._models._sklearn_model.BAYES class
malaya.sentiment.xgb(validate=True)[source]

Load XGB sentiment model.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:XGB
Return type:malaya._models._sklearn_model.XGB class
malaya.sentiment.bert(model='base', validate=True)[source]

Load BERT sentiment model.

Parameters:
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'multilanguage' - bert multilanguage released by Google, trained on sentiment analysis.
    • 'base' - base bert-bahasa released by Malaya, trained on sentiment analysis.
    • 'small' - small bert-bahasa released by Malaya, trained on sentiment analysis.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

BERT

Return type:

malaya._models._bert_model.BINARY_BERT class

malaya.spell

malaya.spell.probability(validate=True)[source]

Train a Probability Spell Corrector.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:_SpellCorrector
Return type:malaya.spell._SpellCorrector class
malaya.spell.symspell(validate=True, max_edit_distance_dictionary=2, prefix_length=7, term_index=0, count_index=1, top_k=10)[source]

Train a symspell Spell Corrector.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:_SpellCorrector
Return type:malaya.spell._SymspellCorrector class
class malaya.spell._SpellCorrector[source]

The SpellCorrector extends the functionality of the Peter Norvig’s spell-corrector in http://norvig.com/spell-correct.html And improve it using some algorithms from Normalization of noisy texts in Malaysian online reviews, https://www.researchgate.net/publication/287050449_Normalization_of_noisy_texts_in_Malaysian_online_reviews Added custom vowels augmentation

P(word)[source]

Probability of word.

static edit_step(word)[source]

All edits that are one edit away from word.

edits2(word)[source]

All edits that are two edits away from word.

known(words)[source]

The subset of words that appear in the dictionary of WORDS.

edit_candidates(word)[source]

Generate possible spelling corrections for word.

correct(word, **kwargs)[source]

Most probable spelling correction for word.

correct_text(text)[source]

Correct all the words within a text, returning the corrected text.

correct_match(match)[source]

Spell-correct word in match, and preserve proper upper/lower/title case.

correct_word(word)[source]

Spell-correct word in match, and preserve proper upper/lower/title case.

static case_of(text)[source]

Return the case-function appropriate for text: upper, lower, title, or just str.

malaya.stack

malaya.stack.voting_stack(models, text)[source]

Stacking for POS, Entities and Dependency models.

Parameters:
  • models (list) – list of models.
  • text (str) – string to predict.
Returns:

result

Return type:

list

malaya.stack.predict_stack(models, strings, mode='gmean')[source]

Stacking for predictive models.

Parameters:
  • models (list) – list of models.
  • strings (str or list of str) – strings to predict.
  • mode (str, optional (default='gmean')) –

    Model architecture supported. Allowed values:

    • 'gmean' - geometrical mean.
    • 'hmean' - harmonic mean.
    • 'mean' - mean.
    • 'min' - min.
    • 'max' - max.
    • 'median' - Harrell-Davis median.
Returns:

result

Return type:

dict

malaya.stem

malaya.stem.naive(word)[source]

Stem a string using startswith and endswith.

Parameters:string (str) –
Returns:string
Return type:stemmed string
malaya.stem.available_deep_model()[source]

List available deep learning stemming models.

malaya.stem.sastrawi(string)[source]

Stem a string using Sastrawi.

Parameters:string (str) –
Returns:string
Return type:stemmed string.
malaya.stem.deep_model(model='bahdanau', validate=True)[source]

Load seq2seq stemmer deep learning model.

Returns:DEEP_STEMMER
Return type:malaya.stemmer._DEEP_STEMMER class
class malaya.stem._DEEP_STEMMER[source]
stem(string)[source]

Stem a string.

Parameters:string (str) –
Returns:string
Return type:stemmed string

malaya.subjective

malaya.subjective.available_deep_model()[source]

List available deep learning subjectivity analysis models.

malaya.subjective.available_bert_model()[source]

List available bert subjectivity analysis models.

malaya.subjective.deep_model(model='luong', validate=True)[source]

Load deep learning subjectivity analysis model.

Parameters:
  • model (str, optional (default='luong')) –

    Model architecture supported. Allowed values:

    • 'self-attention' - Fast-text architecture, embedded and logits layers only with self attention.
    • 'bahdanau' - LSTM with bahdanau attention architecture.
    • 'luong' - LSTM with luong attention architecture.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

SOFTMAX

Return type:

malaya._models._tensorflow_model.SOFTMAX class

malaya.subjective.multinomial(validate=True)[source]

Load multinomial subjectivity model.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:BAYES
Return type:malaya._models._sklearn_model.BAYES class
malaya.subjective.xgb(validate=True)[source]

Load XGB subjectivity model.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:XGB
Return type:malaya._models._sklearn_model.XGB class
malaya.subjective.bert(model='base', validate=True)[source]

Load BERT subjectivity model.

Parameters:
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'multilanguage' - bert multilanguage released by Google, trained on subjectivity analysis.
    • 'base' - base bert-bahasa released by Malaya, trained on subjectivity analysis.
    • 'small' - small bert-bahasa released by Malaya, trained on subjectivity analysis.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

BERT

Return type:

malaya._models._tensorflow_model.BINARY_BERT class

malaya.summarize

malaya.summarize.available_skipthought()[source]

List available deep skip-thought models.

malaya.summarize.deep_skipthought(model='lstm')[source]

Load deep learning skipthought model.

Parameters:model (str, optional (default='skip-thought')) –

Model architecture supported. Allowed values:

  • 'lstm' - LSTM skip-thought deep learning model trained on news dataset. Hopefully we can train on wikipedia dataset.
  • 'residual-network' - residual network with Bahdanau Attention skip-thought deep learning model trained on wikipedia dataset.
Returns:_DEEP_SKIPTHOUGHT
Return type:malaya.summarize._DEEP_SKIPTHOUGHT class
malaya.summarize.encoder(vectorizer)[source]

Encoder interface for text similarity.

Parameters:vectorizer (object) – encoder interface object, BERT, skip-thought, XLNET.
Returns:_DOC2VEC_SIMILARITY
Return type:malaya.similarity._DOC2VEC_SIMILARITY
malaya.summarize.lda(corpus, top_k=3, important_words=10, max_df=0.95, min_df=2, ngram=(1, 3), vectorizer='bow', **kwargs)[source]

summarize a list of strings using LDA, scoring using TextRank.

Parameters:
  • corpus (list) –
  • top_k (int, (default=3)) – number of summarized strings.
  • important_words (int, (default=10)) – number of important words.
  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
  • vectorizer (str, (default='bow')) –

    vectorizer technique. Allowed values:

    • 'bow' - Bag of Word.
    • 'tfidf' - Term frequency inverse Document Frequency.
    • 'skip-gram' - Bag of Word with skipping certain n-grams.
Returns:

dict

Return type:

result

malaya.summarize.lsa(corpus, top_k=3, important_words=10, max_df=0.95, min_df=2, ngram=(1, 3), vectorizer='bow', **kwargs)[source]

summarize a list of strings using LSA, scoring using TextRank.

Parameters:
  • corpus (list) –
  • top_k (int, (default=3)) – number of summarized strings.
  • important_words (int, (default=10)) – number of important words.
  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
  • vectorizer (str, (default='bow')) –

    vectorizer technique. Allowed values:

    • 'bow' - Bag of Word.
    • 'tfidf' - Term frequency inverse Document Frequency.
    • 'skip-gram' - Bag of Word with skipping certain n-grams.
Returns:

dict

Return type:

result

malaya.summarize.doc2vec(vectorizer, corpus, top_k=3, aggregation='mean', soft=True)[source]

summarize a list of strings using doc2vec, scoring using TextRank.

Parameters:
  • vectorizer (object) – fast-text or word2vec interface object.
  • corpus (list) –
  • top_k (int, (default=3)) – number of summarized strings.
  • aggregation (str, optional (default='mean')) –

    Aggregation supported. Allowed values:

    • 'mean' - mean.
    • 'min' - min.
    • 'max' - max.
    • 'sum' - sum.
    • 'sqrt' - square root.
  • soft (bool, optional (default=True)) – word not inside vectorizer will replace with nearest word if True, else, will skip.
Returns:

dictionary

Return type:

result

class malaya.summarize._DEEP_SUMMARIZER[source]
summarize(corpus, top_k=3, important_words=3, **kwargs)[source]

Summarize list of strings / corpus

Parameters:
  • corpus (str, list) –
  • top_k (int, (default=3)) – number of summarized strings.
  • important_words (int, (default=3)) – number of important words.
Returns:

string

Return type:

summarized string

class malaya.summarize._DEEP_SKIPTHOUGHT[source]
vectorize(strings)[source]

Vectorize string inputs using bert attention.

Parameters:strings (str / list of str) –
Returns:array
Return type:vectorized strings

malaya.similarity

malaya.similarity.doc2vec(vectorizer)[source]

Doc2vec interface for text similarity.

Parameters:vectorizer (object) – word vector interface object, fast-text, word2vec, elmo.
Returns:_DOC2VEC_SIMILARITY
Return type:malaya.similarity._DOC2VEC_SIMILARITY
malaya.similarity.encoder(vectorizer)[source]

Encoder interface for text similarity.

Parameters:vectorizer (object) – encoder interface object, BERT, skip-thought, XLNET.
Returns:_DOC2VEC_SIMILARITY
Return type:malaya.similarity._DOC2VEC_SIMILARITY
malaya.similarity.available_bert_model()[source]

List available bert models.

malaya.similarity.bert(model='base', validate=True)[source]

Load BERT similarity model.

Parameters:
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'multilanguage' - bert multilanguage released by Google, trained on cross-entropy similarity.
    • 'base' - base bert-bahasa released by Malaya, trained on cross-entropy similarity.
    • 'small' - small bert-bahasa released by Malaya, trained on cross-entropy similarity.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

SIMILARITY_BERT

Return type:

malaya._models._tensorflow_model.SIAMESE_BERT class

class malaya.similarity._VECTORIZER_SIMILARITY[source]
predict(left_string, right_string, similarity='cosine')[source]

calculate similarity for two different texts.

Parameters:
  • left_string (str) –
  • right_string (str) –
  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.
    • 'euclidean' - euclidean similarity.
    • 'manhattan' - manhattan similarity.
Returns:

float

Return type:

float

predict_batch(left_strings, right_strings, similarity='cosine')[source]

calculate similarity for two different batch of texts.

Parameters:
  • left_strings (list of str) –
  • right_strings (list of str) –
  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.
    • 'euclidean' - euclidean similarity.
    • 'manhattan' - manhattan similarity.
Returns:

list

Return type:

list of float

tree_plot(strings, similarity='cosine', visualize=True, figsize=(7, 7), annotate=True)[source]

plot a tree plot based on output from bert similarity.

Parameters:
  • strings (list of str) – list of strings.
  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.
    • 'euclidean' - euclidean similarity.
    • 'manhattan' - manhattan similarity.
  • visualize (bool) – if True, it will render plt.show, else return data.
  • figsize (tuple, (default=(7, 7))) – figure size for plot.
Returns:

list_dictionaries

Return type:

list of results

class malaya.similarity._DOC2VEC_SIMILARITY[source]
predict(left_string, right_string, aggregation='mean', similarity='cosine', soft=True)[source]

calculate similarity for two different texts.

Parameters:
  • left_string (str) –
  • right_string (str) –
  • aggregation (str, optional (default='mean')) –

    Aggregation supported. Allowed values:

    • 'mean' - mean.
    • 'min' - min.
    • 'max' - max.
    • 'sum' - sum.
    • 'sqrt' - square root.
  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.
    • 'euclidean' - euclidean similarity.
    • 'manhattan' - manhattan similarity.
  • soft (bool, optional (default=True)) – word not inside word vector will replace with nearest word if True, else, will skip.
Returns:

float

Return type:

float

predict_batch(left_strings, right_strings, aggregation='mean', similarity='cosine', soft=True)[source]

calculate similarity for two different batch of texts.

Parameters:
  • left_strings (list of str) –
  • right_strings (list of str) –
  • aggregation (str, optional (default='mean')) –

    Aggregation supported. Allowed values:

    • 'mean' - mean.
    • 'min' - min.
    • 'max' - max.
    • 'sum' - sum.
    • 'sqrt' - square root.
  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.
    • 'euclidean' - euclidean similarity.
    • 'manhattan' - manhattan similarity.
  • soft (bool, optional (default=True)) – word not inside word vector will replace with nearest word if True, else, will skip.
Returns:

list

Return type:

list of float

tree_plot(strings, aggregation='mean', similarity='cosine', soft=True, visualize=True, figsize=(7, 7), annotate=True)[source]

plot a tree plot based on output from bert similarity.

Parameters:
  • strings (list of str) – list of strings
  • aggregation (str, optional (default='mean')) –

    Aggregation supported. Allowed values:

    • 'mean' - mean.
    • 'min' - min.
    • 'max' - max.
    • 'sum' - sum.
    • 'sqrt' - square root.
  • similarity (str, optional (default='mean')) –

    similarity supported. Allowed values:

    • 'cosine' - cosine similarity.
    • 'euclidean' - euclidean similarity.
    • 'manhattan' - manhattan similarity.
  • soft (bool, optional (default=True)) – word not inside word vector will replace with nearest word if True, else, will skip.
  • visualize (bool) – if True, it will render plt.show, else return data.
  • figsize (tuple, (default=(7, 7))) – figure size for plot.
Returns:

list_dictionaries

Return type:

list of results

malaya.topic_model

malaya.topic_model.lda(corpus, n_topics=10, max_df=0.95, min_df=2, ngram=(1, 3), stemming=<function sastrawi>, vectorizer='bow', cleaning=<function simple_textcleaning>, stop_words=None, **kwargs)[source]

Train a LDA model to do topic modelling based on corpus / list of strings given.

Parameters:
  • corpus (list) –
  • n_topics (int, (default=10)) – size of decomposition column.
  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
  • stemming (function, (default=sastrawi)) – function to stem the corpus.
  • vectorizer (str, (default='bow')) –

    vectorizer technique. Allowed values:

    • 'bow' - Bag of Word.
    • 'tfidf' - Term frequency inverse Document Frequency.
    • 'skip-gram' - Bag of Word with skipping certain n-grams.
  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
  • stop_words (list, (default=None)) – list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
Returns:

_TOPIC

Return type:

malaya.topic_modelling._TOPIC class

malaya.topic_model.nmf(corpus, n_topics=10, max_df=0.95, min_df=2, ngram=(1, 3), stemming=<function sastrawi>, vectorizer='bow', cleaning=<function simple_textcleaning>, stop_words=None, **kwargs)[source]

Train a NMF model to do topic modelling based on corpus / list of strings given.

Parameters:
  • corpus (list) –
  • n_topics (int, (default=10)) – size of decomposition column.
  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
  • stemming (function, (default=sastrawi)) – function to stem the corpus.
  • vectorizer (str, (default='bow')) –

    vectorizer technique. Allowed values:

    • 'bow' - Bag of Word.
    • 'tfidf' - Term frequency inverse Document Frequency.
    • 'skip-gram' - Bag of Word with skipping certain n-grams.
  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
  • stop_words (list, (default=None)) – list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
Returns:

_TOPIC

Return type:

malaya.topic_modelling._TOPIC class

malaya.topic_model.lsa(corpus, n_topics, max_df=0.95, min_df=2, ngram=(1, 3), vectorizer='bow', stemming=<function sastrawi>, cleaning=<function simple_textcleaning>, stop_words=None, **kwargs)[source]

Train a LSA model to do topic modelling based on corpus / list of strings given.

Parameters:
  • corpus (list) –
  • n_topics (int, (default=10)) – size of decomposition column.
  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
  • vectorizer (str, (default='bow')) –

    vectorizer technique. Allowed values:

    • 'bow' - Bag of Word.
    • 'tfidf' - Term frequency inverse Document Frequency.
    • 'skip-gram' - Bag of Word with skipping certain n-grams.
  • stemming (function, (default=sastrawi)) – function to stem the corpus.
  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
  • stop_words (list, (default=None)) – list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
Returns:

_TOPIC

Return type:

malaya.topic_modelling._TOPIC class

malaya.topic_model.lda2vec(corpus, n_topics, stemming=<function sastrawi>, max_df=0.95, min_df=2, ngram=(1, 3), cleaning=<function simple_textcleaning>, vectorizer='bow', stop_words=None, window_size=2, embedding_size=128, epoch=10, switch_loss=3, skip=5, **kwargs)[source]

Train a LDA2Vec model to do topic modelling based on corpus / list of strings given.

Parameters:
  • corpus (list) –
  • n_topics (int, (default=10)) – size of decomposition column.
  • stemming (function, (default=sastrawi)) – function to stem the corpus.
  • max_df (float, (default=0.95)) – maximum of a word selected based on document frequency.
  • min_df (int, (default=2)) – minimum of a word selected on based on document frequency.
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
  • stop_words (list, (default=None)) – list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
  • embedding_size (int, (default=128)) – embedding size of lda2vec tensors.
  • training_iteration (int, (default=10)) – training iteration, how many loop need to train.
  • switch_loss (int, (default=3)) – baseline to switch from document based loss to document + word based loss.
  • vectorizer (str, (default='bow')) –

    vectorizer technique. Allowed values:

    • 'bow' - Bag of Word.
    • 'tfidf' - Term frequency inverse Document Frequency.
    • 'skip-gram' - Bag of Word with skipping certain n-grams.
  • skip (int, (default=5)) – skip value if vectorizer = ‘skip-gram’
Returns:

_DEEP_TOPIC

Return type:

malaya.topic_modelling._DEEP_TOPIC class

malaya.topic_model.attention(corpus, n_topics, vectorizer, stemming=<function sastrawi>, cleaning=<function simple_textcleaning>, stop_words=None, ngram=(1, 3))[source]

Use attention from vectorizer model to do topic modelling based on corpus / list of strings given.

Parameters:
  • corpus (list) –
  • n_topics (int, (default=10)) – size of decomposition column.
  • vectorizer (object) –
  • stemming (function, (default=sastrawi)) – function to stem the corpus.
  • cleaning (function, (default=simple_textcleaning)) – function to clean the corpus.
  • stop_words (list, (default=None)) – list of stop words to remove. If None, default is malaya.texts._text_functions.STOPWORDS
  • ngram (tuple, (default=(1,3))) – n-grams size to train a corpus.
Returns:

_ATTENTION_TOPIC

Return type:

malaya.topic_modelling._ATTENTION_TOPIC class

class malaya.topic_model._TOPIC[source]
visualize_topics(notebook_mode=False, mds='pcoa')[source]

Print important topics based on decomposition.

Parameters:mds (str, optional (default='pcoa')) –

2D Decomposition. Allowed values:

  • 'pcoa' - Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)
  • 'mmds' - Dimension reduction via Multidimensional scaling
  • 'tsne' - Dimension reduction via t-distributed stochastic neighbor embedding
top_topics(len_topic, top_n=10, return_df=True)[source]

Print important topics based on decomposition.

Parameters:len_topic (int) –
get_topics(len_topic)[source]

Return important topics based on decomposition.

Parameters:len_topic (int) –
Returns:results
Return type:list of strings
get_sentences(len_sentence, k=0)[source]

Return important sentences related to selected column based on decomposition.

Parameters:
  • len_sentence (int) –
  • k (int, (default=0)) – index of decomposition matrix.
Returns:

results

Return type:

list of strings

class malaya.topic_model._DEEP_TOPIC[source]
visualize_topics(notebook_mode=False, mds='pcoa')[source]

Print important topics based on decomposition.

Parameters:mds (str, optional (default='pcoa')) –

2D Decomposition. Allowed values:

  • 'pcoa' - Dimension reduction via Jensen-Shannon Divergence & Principal Coordinate Analysis (aka Classical Multidimensional Scaling)
  • 'mmds' - Dimension reduction via Multidimensional scaling
  • 'tsne' - Dimension reduction via t-distributed stochastic neighbor embedding
top_topics(len_topic, top_n=10, return_df=True)[source]

Print important topics based on decomposition.

Parameters:len_topic (int) –
get_topics(len_topic)[source]

Return important topics based on decomposition.

Parameters:len_topic (int) –
Returns:results
Return type:list of strings
get_sentences(len_sentence, k=0)[source]

Return important sentences related to selected column based on decomposition.

Parameters:
  • len_sentence (int) –
  • k (int, (default=0)) – index of decomposition matrix.
Returns:

results

Return type:

list of strings

malaya.toxic

malaya.toxic.available_deep_model()[source]

List available deep learning toxicity analysis models.

malaya.toxic.available_bert_model()[source]

List available bert toxicity analysis models.

malaya.toxic.multinomial(validate=True)[source]

Load multinomial toxic model.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:TOXIC
Return type:malaya._models._sklearn_model.TOXIC class
malaya.toxic.logistic(validate=True)[source]

Load logistic toxic model.

Parameters:validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:TOXIC
Return type:malaya._models._sklearn_model.TOXIC class
malaya.toxic.deep_model(model='luong', validate=True)[source]

Load deep learning toxicity analysis model.

Parameters:
  • model (str, optional (default='luong')) –

    Model architecture supported. Allowed values:

    • 'self-attention' - Fast-text architecture, embedded and logits layers only with self attention.
    • 'bahdanau' - LSTM with bahdanau attention architecture.
    • 'luong' - LSTM with luong attention architecture.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

SIGMOID

Return type:

malaya._models._tensorflow_model.SIGMOID class

malaya.toxic.bert(model='base', validate=True)[source]

Load BERT toxicity model.

Parameters:
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'multilanguage' - bert multilanguage released by Google, trained on toxicity analysis.
    • 'base' - base bert-bahasa released by Malaya, trained on toxicity analysis.
    • 'small' - small bert-bahasa released by Malaya, trained on toxicity analysis.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

SIGMOID_BERT

Return type:

malaya._models._tensorflow_model.SIGMOID_BERT class

malaya.wordvector

malaya.wordvector.load_wiki()[source]

Return malaya pretrained wikipedia word2vec size 256.

Returns:dictionary
Return type:dictionary of dictionary, reverse dictionary and vectors
malaya.wordvector.load_news(size=256)[source]

Return malaya pretrained news word2vec.

Parameters:size (int, (default=256)) –
Returns:dictionary
Return type:dictionary of dictionary, reverse dictionary and vectors
malaya.wordvector.load(embed_matrix, dictionary)[source]

Return malaya.wordvector._wordvector object.

Parameters:
  • embed_matrix (numpy array) –
  • dictionary (dictionary) –
Returns:

_wordvector

Return type:

malaya.wordvector._wordvector object

class malaya.wordvector._wordvector[source]
get_vector_by_name(word)[source]

get vector based on string.

Parameters:word (str) –
Returns:vector
Return type:numpy
tree_plot(labels, figsize=(7, 7), annotate=True)[source]

plot a tree plot based on output from calculator / n_closest / analogy.

Parameters:
  • labels (list) – output from calculator / n_closest / analogy.
  • visualize (bool) – if True, it will render plt.show, else return data.
  • figsize (tuple, (default=(7, 7))) – figure size for plot.
Returns:

results

Return type:

[embed, labelled]

scatter_plot(labels, centre=None, figsize=(7, 7), plus_minus=25, handoff=5e-05)[source]

plot a scatter plot based on output from calculator / n_closest / analogy.

Parameters:
  • labels (list) – output from calculator / n_closest / analogy
  • centre (str, (default=None)) – centre label, if a str, it will annotate in a red color.
  • figsize (tuple, (default=(7, 7))) – figure size for plot.
Returns:

list_dictionaries

Return type:

list of results

batch_calculator(equations, num_closest=5, return_similarity=False)[source]

batch calculator parser for word2vec using tensorflow.

Parameters:
  • equations (list of str) – Eg, ‘[(mahathir + najib) - rosmah]’
  • num_closest (int, (default=5)) – number of words closest to the result.
Returns:

word_list

Return type:

list of nearest words

calculator(equation, num_closest=5, metric='cosine', return_similarity=True)[source]

calculator parser for word2vec.

Parameters:
  • equation (str) – Eg, ‘(mahathir + najib) - rosmah’
  • num_closest (int, (default=5)) – number of words closest to the result.
  • metric (str, (default='cosine')) – vector distance algorithm.
  • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.
Returns:

word_list

Return type:

list of nearest words

batch_n_closest(words, num_closest=5, return_similarity=False, soft=True)[source]

find nearest words based on a batch of words using Tensorflow.

Parameters:
  • words (list) – Eg, [‘najib’,’anwar’]
  • num_closest (int, (default=5)) – number of words closest to the result.
  • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.
  • soft (bool, (default=True)) – if True, a word not in the dictionary will be replaced with nearest JaroWinkler ratio. if False, it will throw an exception if a word not in the dictionary.
Returns:

word_list

Return type:

list of nearest words

n_closest(word, num_closest=5, metric='cosine', return_similarity=True)[source]

find nearest words based on a word.

Parameters:
  • word (str) – Eg, ‘najib’
  • num_closest (int, (default=5)) – number of words closest to the result.
  • metric (str, (default='cosine')) – vector distance algorithm.
  • return_similarity (bool, (default=True)) – if True, will return between 0-1 represents the distance.
Returns:

word_list

Return type:

list of nearest words

analogy(a, b, c, num=1, metric='cosine')[source]

analogy calculation, vb - va + vc.

Parameters:
  • a (str) –
  • b (str) –
  • c (str) –
  • num (int, (default=1)) –
  • metric (str, (default='cosine')) – vector distance algorithm.
Returns:

word_list

Return type:

list of nearest words

project_2d(start, end)[source]

project word2vec into 2d dimension.

Parameters:
  • start (int) –
  • end (int) –
Returns:

tsne decomposition

Return type:

numpy

network(word, num_closest=8, depth=4, min_distance=0.5, iteration=300, figsize=(15, 15), node_color='#72bbd0', node_factor=50)[source]

plot a social network based on word given

Parameters:
  • word (str) – centre of social network.
  • num_closest (int, (default=8)) – number of words closest to the node.
  • depth (int, (default=4)) – depth of social network. More deeper more expensive to calculate, big^O(num_closest ** depth).
  • min_distance (float, (default=0.5)) – minimum distance among nodes. Increase the value to increase the distance among nodes.
  • iteration (int, (default=300)) – number of loops to train the social network to fit min_distace.
  • figsize (tuple, (default=(15, 15))) – figure size for plot.
  • node_color (str, (default='#72bbd0')) – color for nodes.
  • node_factor (int, (default=10)) – size factor for depth nodes. Increase this value will increase nodes sizes based on depth.
Returns:

g

Return type:

networkx graph object

malaya.xlnet

malaya.xlnet.available_xlnet_model()[source]

List available xlnet models.

malaya.xlnet.xlnet(model='base', pool_mode='last', validate=True)[source]

Load xlnet model.

Parameters:
  • model (str, optional (default='base')) –

    Model architecture supported. Allowed values:

    • 'base' - base xlnet-bahasa released by Malaya.
    • 'small' - small xlnet-bahasa released by Malaya.
  • pool_mode (str, optional (default='last')) –

    Model logits architecture supported. Allowed values:

    • 'last' - last of the sequence.
    • 'first' - first of the sequence.
    • 'mean' - mean of the sequence.
    • 'attn' - attention of the sequence.
  • validate (bool, optional (default=True)) – if True, malaya will check model availability and download if not available.
Returns:

XLNET_MODEL

Return type:

malaya.xlnet._Model class

class malaya.xlnet._Model[source]
vectorize(strings)[source]

Vectorize string inputs using bert attention.

Parameters:strings (str / list of str) –
Returns:array
Return type:vectorized strings
attention(strings, method='last', **kwargs)[source]

Get attention string inputs from xlnet attention.

Parameters:
  • strings (str / list of str) –
  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.
    • 'first' - attention from first layer.
    • 'mean' - average attentions from all layers.
Returns:

array

Return type:

attention

malaya._models._tensorflow_model

class malaya._models._tensorflow_model._SPARSE_SOFTMAX_MODEL[source]
class malaya._models._tensorflow_model.DEPENDENCY[source]
print_transitions_tag(top_k=10)[source]

Print important top-k transitions for tagging dependency.

Parameters:top_k (int) –
print_transitions_index(top_k=10)[source]

Print important top-k transitions for indexing dependency.

Parameters:top_k (int) –
print_features(top_k=10)[source]

Print important top-k features.

Parameters:top_k (int) –
predict(string)[source]

Tag a string.

Parameters:string (str) –
Returns:string
Return type:tagged string
class malaya._models._tensorflow_model.TAGGING[source]
print_transitions(top_k=10)[source]

Print important top-k transitions.

Parameters:top_k (int) –
print_features(top_k=10)[source]

Print important top-k features.

Parameters:top_k (int) –
analyze(string)[source]

Analyze a string.

Parameters:string (str) –
Returns:string
Return type:analyzed string
predict(string)[source]

Tag a string.

Parameters:string (str) –
Returns:string
Return type:tagged string
class malaya._models._tensorflow_model.BINARY_SOFTMAX[source]
predict(string, get_proba=False, add_neutral=True)[source]

classify a string.

Parameters:
  • string (str) –
  • get_proba (bool, optional (default=False)) – If True, it will return probability of classes.
  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
Returns:

dictionary

Return type:

results

predict_words(string, visualization=True)[source]

classify words.

Parameters:
  • string (str) –
  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
Returns:

dictionary

Return type:

results

predict_batch(strings, get_proba=False, add_neutral=True)[source]

classify list of strings.

Parameters:
  • strings (list) –
  • get_proba (bool, optional (default=False)) – If True, it will return probability of classes.
  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
Returns:

list_dictionaries

Return type:

list of results

class malaya._models._tensorflow_model.MULTICLASS_SOFTMAX[source]
predict(string, get_proba=False)[source]

classify a string.

Parameters:
  • string (str) –
  • get_proba (bool, optional (default=False)) – If True, it will return probability of classes.
Returns:

dictionary

Return type:

results

predict_words(string, visualization=True)[source]

classify words.

Parameters:
  • string (str) –
  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
Returns:

dictionary

Return type:

results

predict_batch(strings, get_proba=False)[source]

classify list of strings.

Parameters:
  • strings (list) –
  • get_proba (bool, optional (default=False)) – If True, it will return probability of classes.
Returns:

list_dictionaries

Return type:

list of results

class malaya._models._tensorflow_model.SIGMOID[source]
predict(string, get_proba=False)[source]

classify a string.

Parameters:string (str) –
Returns:dictionary
Return type:results
predict_words(string, visualization=True)[source]

classify words.

Parameters:
  • string (str) –
  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
Returns:

dictionary

Return type:

results

predict_batch(strings, get_proba=False)[source]

classify list of strings.

Parameters:strings (list) –
Returns:list_dictionaries
Return type:list of results

malaya._models._bert_model

class malaya._models._bert_model.BINARY_BERT[source]
predict(string, get_proba=False, add_neutral=True)[source]

classify a string.

Parameters:
  • string (str) –
  • get_proba (bool, optional (default=False)) – If True, it will return probability of classes.
  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
Returns:

dictionary

Return type:

results

predict_batch(strings, get_proba=False, add_neutral=True)[source]

classify list of strings.

Parameters:
  • strings (list) –
  • get_proba (bool, optional (default=False)) – If True, it will return probability of classes.
  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
Returns:

list_dictionaries

Return type:

list of results

predict_words(string, method='last', visualization=True)[source]

classify words.

Parameters:
  • string (str) –
  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.
    • 'first' - attention from first layer.
    • 'mean' - average attentions from all layers.
  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
Returns:

dictionary

Return type:

results

class malaya._models._bert_model.MULTICLASS_BERT[source]
predict(string, get_proba=False)[source]

classify a string.

Parameters:
  • string (str) –
  • get_proba (bool, optional (default=False)) – If True, it will return probability of classes.
  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
Returns:

dictionary

Return type:

results

predict_batch(strings, get_proba=False)[source]

classify list of strings.

Parameters:
  • strings (list) –
  • get_proba (bool, optional (default=False)) – If True, it will return probability of classes.
Returns:

list_dictionaries

Return type:

list of results

predict_words(string, method='last', visualization=True)[source]

classify words.

Parameters:
  • string (str) –
  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.
    • 'first' - attention from first layer.
    • 'mean' - average attentions from all layers.
  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
Returns:

dictionary

Return type:

results

class malaya._models._bert_model.SIGMOID_BERT[source]
predict(string, get_proba=False)[source]

classify a string.

Parameters:
  • string (str) –
  • get_proba (bool, optional (default=False)) – If True, it will return probability of classes.
  • add_neutral (bool, optional (default=True)) – if True, it will add neutral probability.
Returns:

dictionary

Return type:

results

predict_batch(strings, get_proba=False)[source]

classify list of strings.

Parameters:
  • strings (list) –
  • get_proba (bool, optional (default=False)) – If True, it will return probability of classes.
Returns:

list_dictionaries

Return type:

list of results

predict_words(string, method='last', visualization=True)[source]

classify words.

Parameters:
  • string (str) –
  • method (str, optional (default='last')) –

    Attention layer supported. Allowed values:

    • 'last' - attention from last layer.
    • 'first' - attention from first layer.
    • 'mean' - average attentions from all layers.
  • visualization (bool, optional (default=True)) – If True, it will open the visualization dashboard.
Returns:

dictionary

Return type:

results

class malaya._models._bert_model.SIAMESE_BERT[source]
predict(string_left, string_right)[source]

calculate similarity for two different texts.

Parameters:
  • string_left (str) –
  • string_right (str) –
Returns:

float

Return type:

float

predict_batch(strings_left, strings_right)[source]

calculate similarity for two different batch of texts.

Parameters:
  • string_left (str) –
  • string_right (str) –
Returns:

list

Return type:

list of float

class malaya._models._bert_model.TAGGING_BERT[source]
analyze(string)[source]

Analyze a string.

Parameters:string (str) –
Returns:string
Return type:analyzed string
predict(string)[source]

Tag a string.

Parameters:string (str) –
Returns:string
Return type:tagged string