Segmentation HuggingFace
Contents
Segmentation HuggingFace#
This tutorial is available as an IPython notebook at Malaya/example/segmentation-huggingface.
This module trained on both standard and local (included social media) language structures, so it is save to use for both.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import logging
logging.basicConfig(level=logging.INFO)
[3]:
%%time
import malaya
CPU times: user 3.07 s, sys: 3.53 s, total: 6.6 s
Wall time: 2.28 s
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3372
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/dev/malaya/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3890
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
Common problem for social media texts, there are missing spaces in the text, so text segmentation can help you,
huseinsukamakan ayam,dia sgtrisaukan -> husein suka makan ayam, dia sgt risaukan.
drmahathir sangat menekankan budaya budakzamansekarang -> dr mahathir sangat menekankan budaya budak zaman sekarang.
ceritatunnajibrazak -> cerita tun najib razak.
TunM sukakan -> Tun M sukakan.
Segmentation only,
Solve spacing error.
Not correcting any grammar.
[4]:
string1 = 'huseinsukamakan ayam,dia sgtrisaukan'
string2 = 'drmahathir sangat menekankan budaya budakzamansekarang'
string3 = 'ceritatunnajibrazak'
string4 = 'TunM sukakan'
string_hard = 'IPOH-AhliDewanUndangan Negeri(ADUN) HuluKinta, MuhamadArafat Varisai Mahamadmenafikanmesejtularmendakwa beliau akan melompatparti menyokong UMNO membentuk kerajaannegeridiPerak.BeliauyangjugaKetua Penerangan Parti Keadilan Rakyat(PKR)Perak dalam satumesejringkaskepadaSinar Harian menjelaskan perkara itutidakbenarsama sekali.'
string_socialmedia = 'aqxsukalah apeyg tejadidekat mamattu'
List available HuggingFace model#
[5]:
malaya.segmentation.available_huggingface()
INFO:malaya.segmentation:tested on random generated dataset at https://f000.backblazeb2.com/file/malay-dataset/segmentation/test-set-segmentation.json
[5]:
Size (MB) | WER | Suggested length | |
---|---|---|---|
mesolitica/finetune-segmentation-t5-super-tiny-standard-bahasa-cased | 51.0 | 0.13456 | 256.0 |
mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased | 139.0 | 0.13456 | 256.0 |
mesolitica/finetune-segmentation-t5-small-standard-bahasa-cased | 242.0 | 0.13456 | 256.0 |
Load HuggingFace model#
def huggingface(model: str = 'mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased', **kwargs):
"""
Load HuggingFace model to segmentation.
Parameters
----------
model: str, optional (default='mesolitica/finetune-segmentation-t5-tiny-standard-bahasa-cased')
Check available models at `malaya.segmentation.available_huggingface()`.
Returns
-------
result: malaya.torch_model.huggingface.Generator
"""
[6]:
model = malaya.segmentation.huggingface()
Predict#
def generate(self, strings: List[str], **kwargs):
"""
Generate texts from the input.
Parameters
----------
strings : List[str]
**kwargs: vector arguments pass to huggingface `generate` method.
Read more at https://huggingface.co/docs/transformers/main_classes/text_generation
Returns
-------
result: List[str]
"""
[7]:
%%time
model.generate([string1, string2, string3, string4], max_length = 256)
CPU times: user 1.43 s, sys: 39.4 ms, total: 1.47 s
Wall time: 126 ms
[7]:
['husein suka makan ayam, dia sgt risikokan',
'dr mahathir sangat menekankan budaya budak zaman sekarang',
'cerita tun najib razak',
'Tun M sukakan']
[8]:
%%time
model.generate([string1, string2, string3, string4, string_hard, string_socialmedia], max_length = 256)
CPU times: user 6.13 s, sys: 0 ns, total: 6.13 s
Wall time: 550 ms
[8]:
['husein suka makan ayam, dia sgt risikokan',
'dr mahathir sangat menekankan budaya budak zaman sekarang',
'cerita tun najib razak',
'Tun M sukakan',
'IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Ararat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
'aq x sukalah ape yg tejadi dekat mamat tu']
able to infer mixed MS and EN#
[10]:
string5 = 'ihate chicken, but ilike fish'
string6 = 'Higuys! I noticedsemalam & harini dahramai yangdapat cookiesni kan. So hariniinak sharesome post mortemof our first batch:'
[11]:
%%time
model.generate([string1, string2, string3, string4, string_hard, string_socialmedia,
string5, string6], max_length = 256)
CPU times: user 6.61 s, sys: 0 ns, total: 6.61 s
Wall time: 617 ms
[11]:
['husein suka makan ayam, dia sgt risikokan',
'dr mahathir sangat menekankan budaya budak zaman sekarang',
'cerita tun najib razak',
'Tun M sukakan',
'IPOH - Ahli Dewan Undangan Negeri (ADUN) Hulu Kinta, Muhamad Ararat Varisai Mahamad menafikan mesej tular mendakwa beliau akan melompat parti menyokong UMNO membentuk kerajaan negeri di Perak. Beliau yang juga Ketua Penerangan Parti Keadilan Rakyat (PKR) Perak dalam satu mesej ringkas kepada Sinar Harian menjelaskan perkara itu tidak benar sama sekali.',
'aq x sukalah ape yg tejadi dekat mamat tu',
'i hate chicken, but i like fish',
'Hi guys! I noticed semalam & hari ni dah ramai yang dapat cookies ni kan. So hari ni inak share some post mortem of our first batch:']