Segmentation#

This tutorial is available as an IPython notebook at Malaya/example/segmentation.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

This interface deprecated, use HuggingFace interface instead.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import logging

logging.basicConfig(level=logging.INFO)
[3]:
%%time

import malaya
CPU times: user 3.17 s, sys: 3.44 s, total: 6.61 s
Wall time: 2.25 s
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3372
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3890
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

Common problem for social media texts, there are missing spaces in the text, so text segmentation can help you,

  1. huseinsukamakan ayam,dia sgtrisaukan -> husein suka makan ayam, dia sgt risaukan.

  2. drmahathir sangat menekankan budaya budakzamansekarang -> dr mahathir sangat menekankan budaya budak zaman sekarang.

  3. ceritatunnajibrazak -> cerita tun najib razak.

  4. TunM sukakan -> Tun M sukakan.

Segmentation only,

  1. Solve spacing error.

  2. Not correcting any grammar.

[4]:
import warnings
warnings.filterwarnings('default')
[5]:
string1 = 'huseinsukamakan ayam,dia sgtrisaukan'
string2 = 'drmahathir sangat menekankan budaya budakzamansekarang'
string3 = 'ceritatunnajibrazak'
string4 = 'TunM sukakan'
string_hard = 'IPOH-AhliDewanUndangan Negeri(ADUN) HuluKinta, MuhamadArafat Varisai Mahamadmenafikanmesejtularmendakwa beliau akan melompatparti menyokong UMNO membentuk kerajaannegeridiPerak.BeliauyangjugaKetua Penerangan Parti Keadilan Rakyat(PKR)Perak dalam satumesejringkaskepadaSinar Harian menjelaskan perkara itutidakbenarsama sekali.'
string_socialmedia = 'aqxsukalah apeyg tejadidekat mamattu'

Viterbi algorithm#

Commonly people use Viterbi algorithm to solve this problem, we also added viterbi using ngram from bahasa papers and wikipedia.

def viterbi(max_split_length: int = 20, **kwargs):
    """
    Load Segmenter class using viterbi algorithm.

    Parameters
    ----------
    max_split_length: int, (default=20)
        max length of words in a sentence to segment
    validate: bool, optional (default=True)
        if True, malaya will check model availability and download if not available.

    Returns
    -------
    result : malaya.segmentation.SEGMENTER class
    """
[4]:
viterbi = malaya.segmentation.viterbi()

Segmentize#

def segment(self, strings: List[str]):
    """
    Segment strings.
    Example, "sayasygkan negarasaya" -> "saya sygkan negara saya"

    Parameters
    ----------
    strings : List[str]

    Returns
    -------
    result: List[str]
    """
[5]:
%%time

viterbi.segment([string1, string2, string3, string4])
CPU times: user 109 ms, sys: 1.04 ms, total: 110 ms
Wall time: 110 ms
[5]:
['husein suka makan ayam,dia sgt risau kan',
 'dr mahathir sangat mene kan kan budaya budak zaman sekarang',
 'cerita tu n najib razak',
 'Tun M suka kan']
[6]:
%%time

viterbi.segment([string_hard, string_socialmedia])
CPU times: user 8.45 ms, sys: 157 µs, total: 8.6 ms
Wall time: 8.69 ms
[6]:
['IPOH - Ahli Dewan Undangan Negeri(ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamadmenafikanmesejtularmendakwa belia u akan me lompat part i me nyo ko ng UMNO mem bentuk kerajaannegeridi Perak. Beliauyangjuga Ketua Penerangan Parti Keadilan Rakyat(PKR) Perak dalam satumesejringkaskepada Sinar Harian men jel ask an perkara it u tidak benar sama sekali.',
 'aq x suka lah ape yg te jadi dekat mama ttu']

List available Transformer model#

[6]:
malaya.segmentation.available_transformer()
/home/husein/dev/malaya/malaya/segmentation.py:221: DeprecationWarning: `malaya.segmentation.available_transformer` is deprecated, use `malaya.segmentation.available_huggingface` instead
  warnings.warn(
INFO:malaya.segmentation:tested on random generated dataset at https://f000.backblazeb2.com/file/malay-dataset/segmentation/test-set-segmentation.json
[6]:
Size (MB) Quantized Size (MB) WER Suggested length
small 42.70 13.10 0.208520 256.0
base 234.00 63.80 0.177624 256.0
super-tiny-t5 81.80 27.10 0.032980 256.0
super-super-tiny-t5 39.60 12.00 0.037882 256.0
3x-super-tiny-t5 18.30 4.46 0.059895 256.0
3x-super-tiny-t5-4k 5.03 2.99 0.134560 256.0

Load Transformer model#

def transformer(model: str = 'small', quantized: bool = False, **kwargs):
    """
    Load transformer encoder-decoder model to segmentation.

    Parameters
    ----------
    model: str, optional (default='base')
        Check available models at `malaya.segmentation.available_transformer()`.
    quantized: bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result: malaya.model.tf.Segmentation class
    """
[7]:
model = malaya.segmentation.transformer(model = 'small')
/home/husein/dev/malaya/malaya/segmentation.py:246: DeprecationWarning: `malaya.segmentation.transformer` is deprecated, use `malaya.segmentation.huggingface` instead
  warnings.warn(
2022-11-09 00:17:06.286220: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-09 00:17:06.316678: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-09 00:17:06.316705: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-11-09 00:17:06.316709: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2022-11-09 00:17:06.316810: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-11-09 00:17:06.316835: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3
[ ]:
quantized_model = malaya.segmentation.transformer(model = 'small', quantized = True)
[22]:
model_base = malaya.segmentation.transformer(model = 'base')
quantized_model_base = malaya.segmentation.transformer(model = 'base', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
[11]:
super_super_tiny = malaya.segmentation.transformer(model = 'super-super-tiny-t5')

Predict using greedy decoder#

def greedy_decoder(self, strings: List[str]):
    """
    Segment strings using greedy decoder.
    Example, "sayasygkan negarasaya" -> "saya sygkan negara saya"

    Parameters
    ----------
    strings : List[str]

    Returns
    -------
    result: List[str]
    """
[10]:
%%time

model.greedy_decoder([string1, string2, string3, string4])
CPU times: user 1.12 s, sys: 432 ms, total: 1.55 s
Wall time: 959 ms
[10]:
['husein suka makan ayam, dia sgt risaukan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan']
[11]:
%%time

quantized_model.greedy_decoder([string1, string2, string3, string4])
CPU times: user 1.12 s, sys: 464 ms, total: 1.58 s
Wall time: 888 ms
[11]:
['husein suka makan ayam, dia sgt risaukan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan']
[12]:
%%time

model_base.greedy_decoder([string1, string2, string3, string4])
CPU times: user 5.58 s, sys: 2.88 s, total: 8.46 s
Wall time: 4.08 s
[12]:
['husein suka makan ayam, dia sgt risaukan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak cerita',
 'Tun M sukakan Tun M sukakan']
[13]:
%%time

quantized_model_base.greedy_decoder([string1, string2, string3, string4])
CPU times: user 5.73 s, sys: 2.96 s, total: 8.69 s
Wall time: 3.81 s
[13]:
['husein suka makan ayam, dia sgt risaukan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak cerita tun',
 'Tun M sukakan Tun M sukakan']
[13]:
%%time

super_super_tiny.greedy_decoder([string1, string2, string3, string4])
CPU times: user 908 ms, sys: 433 ms, total: 1.34 s
Wall time: 288 ms
[13]:
['husein suka makan ayam, dia sgt risaukan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan']
[14]:
%%time

model.greedy_decoder([string_hard, string_socialmedia])
CPU times: user 2.52 s, sys: 499 ms, total: 3.02 s
Wall time: 768 ms
[14]:
['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq xsukalah ape yg tejadid dekat mamat tu']
[15]:
%%time

quantized_model.greedy_decoder([string_hard, string_socialmedia])
CPU times: user 2.62 s, sys: 447 ms, total: 3.07 s
Wall time: 756 ms
[15]:
['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq xsukalah ape yg tejadid dekat mamat tu']
[16]:
%%time

model_base.greedy_decoder([string_hard, string_socialmedia])
CPU times: user 17.8 s, sys: 10.2 s, total: 28 s
Wall time: 5.84 s
[16]:
['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq xsukalah ape yg teja di dekat mamat tu aq xsukalah ape yg teja di dekat mamat tu']
[17]:
%%time

quantized_model_base.greedy_decoder([string_hard, string_socialmedia])
CPU times: user 17.6 s, sys: 9.63 s, total: 27.3 s
Wall time: 5.85 s
[17]:
['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq xsukalah ape yg teja di dekat mamat tu aq xsukalah ape yg teja di dekat mamat tu']
[14]:
%%time

super_super_tiny.greedy_decoder([string_hard, string_socialmedia])
CPU times: user 1.34 s, sys: 527 ms, total: 1.87 s
Wall time: 421 ms
[14]:
['IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Arafat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq xsukalah ape yg tejadi dekat mamat tu']

Problem with batching string, short string might repeating itself, so to solve this, you need to give a single string only,

[18]:
%%time

quantized_model_base.greedy_decoder([string_socialmedia])
CPU times: user 1.37 s, sys: 532 ms, total: 1.9 s
Wall time: 652 ms
[18]:
['aq xsukalah ape yg teja di dekat mamat tu']
[19]:
%%time

quantized_model_base.greedy_decoder([string3])
CPU times: user 648 ms, sys: 228 ms, total: 876 ms
Wall time: 289 ms
[19]:
['cerita tun najib razak']
[20]:
%%time

quantized_model_base.greedy_decoder([string4])
CPU times: user 495 ms, sys: 202 ms, total: 697 ms
Wall time: 225 ms
[20]:
['Tun M sukakan']

Predict using beam decoder#

def beam_decoder(self, strings: List[str]):
    """
    Segment strings using beam decoder, beam width size 3, alpha 0.5 .
    Example, "sayasygkan negarasaya" -> "saya sygkan negara saya"

    Parameters
    ----------
    strings : List[str]

    Returns
    -------
    result: List[str]
    """

T5 models not able to use beam decoder.

[11]:
%%time

quantized_model.beam_decoder([string_socialmedia])
CPU times: user 1.38 s, sys: 1.87 s, total: 3.25 s
Wall time: 654 ms
[11]:
['aq xsukalah ape yg tejadid dekat mamat tu']
[12]:
%%time

quantized_model_base.beam_decoder([string_socialmedia])
CPU times: user 6.77 s, sys: 3.71 s, total: 10.5 s
Wall time: 2.43 s
[12]:
['aq xsukalah ape yg teja di dekat mamat tu']

We can expect beam decoder is much more slower than greedy decoder.