MS-EN alignment using HuggingFace
Contents
MS-EN alignment using HuggingFace#
This tutorial is available as an IPython notebook at Malaya/example/alignment-ms-en-huggingface.
Required Tensorflow >= 2.0 for HuggingFace interface.
[1]:
%%time
import malaya
CPU times: user 6.21 s, sys: 1.19 s, total: 7.4 s
Wall time: 8.73 s
List available HuggingFace models#
[2]:
malaya.alignment.ms_en.available_huggingface()
[2]:
Size (MB) | |
---|---|
mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms | 599 |
bert-base-multilingual-cased | 714 |
Load HuggingFace model#
def huggingface(model: str = 'mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms', **kwargs):
"""
Load huggingface BERT model word alignment for MS-EN, Required Tensorflow >= 2.0.
Parameters
----------
model : str, optional (default='mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms')
Model architecture supported. Allowed values:
* ``'mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms'`` - finetuned BERT multilanguage on noisy EN-MS.
* ``'bert-base-multilingual-cased'`` - pretrained BERT multilanguage.
Returns
-------
result: malaya.model.alignment.HuggingFace
"""
[3]:
model = malaya.alignment.ms_en.huggingface()
Some layers from the model checkpoint at mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms were not used when initializing TFBertModel: ['mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at mesolitica/finetuned-bert-base-multilingual-cased-noisy-en-ms and are newly initialized: ['bert/pooler/dense/kernel:0', 'bert/pooler/dense/bias:0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Align#
def align(
self,
source: List[str],
target: List[str],
align_layer: int = 8,
threshold: float = 1e-3,
):
"""
align text using softmax output layers.
Parameters
----------
source: List[str]
target: List[str]
align_layer: int, optional (default=3)
transformer layer-k to choose for embedding output.
threshold: float, optional (default=1e-3)
minimum probability to assume as alignment.
Returns
-------
result: List[List[Tuple]]
"""
[4]:
left = ['Terminal 1 KKIA dilengkapi kemudahan 64 kaunter daftar masuk, 12 aero bridge selain mampu menampung 3,200 penumpang dalam satu masa.']
right = ['Terminal 1 KKIA is equipped with 64 check-in counters, 12 aero bridges and can accommodate 3,200 passengers at a time.']
[5]:
results = model.align(left, right, align_layer = 7)
results
[5]:
[[(0, 0),
(1, 1),
(2, 2),
(3, 4),
(4, 5),
(5, 6),
(6, 7),
(6, 8),
(8, 8),
(9, 9),
(10, 10),
(11, 11),
(12, 12),
(13, 13),
(14, 14),
(15, 15),
(16, 16),
(17, 17),
(18, 18),
(19, 19)]]
[6]:
for i in range(len(left)):
left_splitted = left[i].split()
right_splitted = right[i].split()
for k in results[i]:
print(i, left_splitted[k[0]], right_splitted[k[0]])
0 Terminal Terminal
0 1 1
0 KKIA KKIA
0 dilengkapi is
0 kemudahan equipped
0 64 with
0 kaunter 64
0 kaunter 64
0 masuk, counters,
0 12 12
0 aero aero
0 bridge bridges
0 selain and
0 mampu can
0 menampung accommodate
0 3,200 3,200
0 penumpang passengers
0 dalam at
0 satu a
0 masa. time.
[ ]: