Segmentation#

This tutorial is available as an IPython notebook at Malaya/example/segmentation.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
%%time

import malaya
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/husein/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CPU times: user 2.69 s, sys: 4 s, total: 6.69 s
Wall time: 1.94 s
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

Common problem for social media texts, there are missing spaces in the text, so text segmentation can help you,

  1. huseinsukamakan ayam,dia sgtrisaukan -> husein suka makan ayam, dia sgt risaukan.

  2. drmahathir sangat menekankan budaya budakzamansekarang -> dr mahathir sangat menekankan budaya budak zaman sekarang.

  3. ceritatunnajibrazak -> cerita tun najib razak.

  4. TunM sukakan -> Tun M sukakan.

Segmentation only,

  1. Solve spacing error.

  2. Not correcting any grammar.

[3]:
string1 = 'huseinsukamakan ayam,dia sgtrisaukan'
string2 = 'drmahathir sangat menekankan budaya budakzamansekarang'
string3 = 'ceritatunnajibrazak'
string4 = 'TunM sukakan'
string_hard = 'IPOH-AhliDewanUndangan Negeri(ADUN) HuluKinta, MuhamadArafat Varisai Mahamadmenafikanmesejtularmendakwa beliau akan melompatparti menyokong UMNO membentuk kerajaannegeridiPerak.BeliauyangjugaKetua Penerangan Parti Keadilan Rakyat(PKR)Perak dalam satumesejringkaskepadaSinar Harian menjelaskan perkara itutidakbenarsama sekali.'
string_socialmedia = 'aqxsukalah apeyg tejadidekat mamattu'

List available HuggingFace model#

[4]:
malaya.segmentation.available_huggingface
[4]:
{'mesolitica/finetune-segmentation-t5-super-tiny-standard-bahasa-cased': {'Size (MB)': 51,
  'WER': 0.030962535,
  'CER': 0.0041129253,
  'Suggested length': 256},
 'mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased': {'Size (MB)': 139,
  'WER': 0.0207876127,
  'CER': 0.002146691161,
  'Suggested length': 256},
 'mesolitica/finetune-segmentation-t5-small-standard-bahasa-cased': {'Size (MB)': 242,
  'WER': 0.0202468274,
  'CER': 0.0024325431,
  'Suggested length': 256}}
[5]:
print(malaya.segmentation.info)
tested on random generated dataset at https://f000.backblazeb2.com/file/malay-dataset/segmentation/test-set-segmentation.json

Load HuggingFace model#

def huggingface(
    model: str = 'mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased',
    force_check: bool = True,
    **kwargs,
):
    """
    Load HuggingFace model to segmentation.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased')
        Check available models at `malaya.segmentation.available_huggingface`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.huggingface.Generator
    """
[6]:
model = malaya.segmentation.huggingface()
Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565

Predict#

def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    result: List[str]
    """
[7]:
%%time

model.generate([string1, string2, string3, string4], max_length = 256)
spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.
CPU times: user 1.72 s, sys: 9.94 ms, total: 1.73 s
Wall time: 164 ms
[7]:
['husein suka makan ayam, dia sgt risikokan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan']
[8]:
%%time

model.generate([string1, string2, string3, string4, string_hard, string_socialmedia], max_length = 256)
CPU times: user 5.68 s, sys: 11.6 ms, total: 5.69 s
Wall time: 511 ms
[8]:
['husein suka makan ayam, dia sgt risikokan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan',
 'IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Ararat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq x sukalah ape yg tejadi dekat mamat tu']

able to infer mixed MS and EN#

[9]:
string5 = 'i hate chicken, but i like fish'
string6 = 'hi guys i noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:'
[10]:
%%time

model.generate([string1, string2, string3, string4, string_hard, string_socialmedia,
               string5, string6], max_length = 256)
CPU times: user 6.6 s, sys: 6.91 ms, total: 6.61 s
Wall time: 601 ms
[10]:
['husein suka makan ayam, dia sgt risikokan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan',
 'IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Ararat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq x sukalah ape yg tejadi dekat mamat tu',
 'i hate chicken, but i like fish',
 'hi guys i noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:']