EN-MS alignment using HuggingFace#

This tutorial is available as an IPython notebook at Malaya/example/alignment-en-ms-huggingface.

Required Tensorflow >= 2.0 for HuggingFace interface.

[1]:
%%time
import malaya
CPU times: user 5.72 s, sys: 1.08 s, total: 6.8 s
Wall time: 8.64 s

Install Transformers#

Make sure you already installed transformers,

pip3 install transformers

List available HuggingFace models#

[2]:
malaya.alignment.en_ms.available_huggingface()
[2]:
Size (MB)
mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms 599
bert-base-multilingual-cased 714

Load HuggingFace model#

def huggingface(model: str = 'mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms', **kwargs):
    """
    Load huggingface BERT model word alignment for EN-MS, Required Tensorflow >= 2.0.

    Parameters
    ----------
    model : str, optional (default='mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms')
        Model architecture supported. Allowed values:

        * ``'mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms'`` - finetuned BERT multilanguage on noisy EN-MS.
        * ``'bert-base-multilingual-cased'`` - pretrained BERT multilanguage.

    Returns
    -------
    result: malaya.model.alignment.HuggingFace
    """
[3]:
model = malaya.alignment.en_ms.huggingface()
Some layers from the model checkpoint at mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms were not used when initializing TFBertModel: ['mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms and are newly initialized: ['bert/pooler/dense/kernel:0', 'bert/pooler/dense/bias:0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Align#

def align(
    self,
    source: List[str],
    target: List[str],
    align_layer: int = 8,
    threshold: float = 1e-3,
):
    """
    align text using softmax output layers.

    Parameters
    ----------
    source: List[str]
    target: List[str]
    align_layer: int, optional (default=3)
        transformer layer-k to choose for embedding output.
    threshold: float, optional (default=1e-3)
        minimum probability to assume as alignment.

    Returns
    -------
    result: List[List[Tuple]]
    """
[4]:
right = ['Terminal 1 KKIA dilengkapi kemudahan 64 kaunter daftar masuk, 12 aero bridge selain mampu menampung 3,200 penumpang dalam satu masa.']
left = ['Terminal 1 KKIA is equipped with 64 check-in counters, 12 aero bridges and can accommodate 3,200 passengers at a time.']
[5]:
results = model.align(left, right, align_layer = 7)
results
[5]:
[[(0, 0),
  (1, 1),
  (2, 2),
  (3, 4),
  (4, 5),
  (5, 6),
  (6, 7),
  (6, 8),
  (8, 8),
  (9, 9),
  (10, 10),
  (11, 11),
  (12, 12),
  (13, 13),
  (14, 14),
  (15, 15),
  (16, 16),
  (17, 17),
  (18, 18),
  (19, 19)]]
[7]:
for i in range(len(left)):
    left_splitted = left[i].split()
    right_splitted = right[i].split()
    for k in results[i]:
        print(i, left_splitted[k[0]], right_splitted[k[0]])
0 Terminal Terminal
0 1 1
0 KKIA KKIA
0 dilengkapi is
0 kemudahan equipped
0 64 with
0 kaunter 64
0 kaunter 64
0 masuk, counters,
0 12 12
0 aero aero
0 bridge bridges
0 selain and
0 mampu can
0 menampung accommodate
0 3,200 3,200
0 penumpang passengers
0 dalam at
0 satu a
0 masa. time.
[ ]: