Segmentation HuggingFace#

This tutorial is available as an IPython notebook at Malaya/example/segmentation-huggingface.

This module trained on both standard and local (included social media) language structures, so it is save to use for both.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import logging

logging.basicConfig(level=logging.INFO)
[3]:
%%time

import malaya
CPU times: user 3.07 s, sys: 3.53 s, total: 6.6 s
Wall time: 2.28 s
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3372
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3890
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

Common problem for social media texts, there are missing spaces in the text, so text segmentation can help you,

  1. huseinsukamakan ayam,dia sgtrisaukan -> husein suka makan ayam, dia sgt risaukan.

  2. drmahathir sangat menekankan budaya budakzamansekarang -> dr mahathir sangat menekankan budaya budak zaman sekarang.

  3. ceritatunnajibrazak -> cerita tun najib razak.

  4. TunM sukakan -> Tun M sukakan.

Segmentation only,

  1. Solve spacing error.

  2. Not correcting any grammar.

[4]:
string1 = 'huseinsukamakan ayam,dia sgtrisaukan'
string2 = 'drmahathir sangat menekankan budaya budakzamansekarang'
string3 = 'ceritatunnajibrazak'
string4 = 'TunM sukakan'
string_hard = 'IPOH-AhliDewanUndangan Negeri(ADUN) HuluKinta, MuhamadArafat Varisai Mahamadmenafikanmesejtularmendakwa beliau akan melompatparti menyokong UMNO membentuk kerajaannegeridiPerak.BeliauyangjugaKetua Penerangan Parti Keadilan Rakyat(PKR)Perak dalam satumesejringkaskepadaSinar Harian menjelaskan perkara itutidakbenarsama sekali.'
string_socialmedia = 'aqxsukalah apeyg tejadidekat mamattu'

List available HuggingFace model#

[5]:
malaya.segmentation.available_huggingface()
INFO:malaya.segmentation:tested on random generated dataset at https://f000.backblazeb2.com/file/malay-dataset/segmentation/test-set-segmentation.json
[5]:
Size (MB) WER Suggested length
mesolitica/finetune-segmentation-t5-super-tiny-standard-bahasa-cased 51.0 0.13456 256.0
mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased 139.0 0.13456 256.0
mesolitica/finetune-segmentation-t5-small-standard-bahasa-cased 242.0 0.13456 256.0

Load HuggingFace model#

def huggingface(model: str = 'mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased', **kwargs):
    """
    Load HuggingFace model to segmentation.

    Parameters
    ----------
    model: str, optional (default='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased')
        Check available models at `malaya.segmentation.available_huggingface()`.

    Returns
    -------
    result: malaya.torch_model.huggingface.Generator
    """
[6]:
model = malaya.segmentation.huggingface()

Predict#

def generate(self, strings: List[str], **kwargs):
    """
    Generate texts from the input.

    Parameters
    ----------
    strings : List[str]
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    result: List[str]
    """
[7]:
%%time

model.generate([string1, string2, string3, string4], max_length = 256)
CPU times: user 1.43 s, sys: 39.4 ms, total: 1.47 s
Wall time: 126 ms
[7]:
['husein suka makan ayam, dia sgt risikokan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan']
[8]:
%%time

model.generate([string1, string2, string3, string4, string_hard, string_socialmedia], max_length = 256)
CPU times: user 6.13 s, sys: 0 ns, total: 6.13 s
Wall time: 550 ms
[8]:
['husein suka makan ayam, dia sgt risikokan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan',
 'IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Ararat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq x sukalah ape yg tejadi dekat mamat tu']

able to infer mixed MS and EN#

[10]:
string5 = 'ihate chicken, but ilike fish'
string6 = 'Higuys! I noticedsemalam & harini dahramai yangdapat cookiesni kan. So hariniinak sharesome post mortemof our first batch:'
[11]:
%%time

model.generate([string1, string2, string3, string4, string_hard, string_socialmedia,
               string5, string6], max_length = 256)
CPU times: user 6.61 s, sys: 0 ns, total: 6.61 s
Wall time: 617 ms
[11]:
['husein suka makan ayam, dia sgt risikokan',
 'dr mahathir sangat menekankan budaya budak zaman sekarang',
 'cerita tun najib razak',
 'Tun M sukakan',
 'IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Ararat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
 'aq x sukalah ape yg tejadi dekat mamat tu',
 'i hate chicken, but i like fish',
 'Hi guys! I noticed semalam & hari ni dah ramai yang dapat cookies ni kan. So hari ni inak share some post mortem of our first batch:']