MS-EN alignment using Eflomal
Contents
MS-EN alignment using Eflomal#
This tutorial is available as an IPython notebook at Malaya/example/alignment-ms-en-eflomal.
[1]:
%%time
import malaya
CPU times: user 5.44 s, sys: 1.07 s, total: 6.51 s
Wall time: 8.71 s
What is Eflomal?#
Originally from https://github.com/robertostling/eflomal, a great tool for word alignment task, using probability based.
Installation#
If you are using Linux / Windows, you need to compile the binary from source, https://github.com/robertostling/eflomal,
git clone https://github.com/robertostling/eflomal && cd eflomal
make
sudo make install
python3 setup.py install
You should see eflomal
inside your installation directory, default is /usr/bin
.
Installation for Mac#
If you are using Mac, you need to compile the binary from source, https://github.com/huseinzol05/maceflomal,
git clone https://github.com/huseinzol05/maceflomal && cd maceflomal
export CC=/usr/local/bin/gcc-11
make
sudo make install
python3 setup.py install
You should see eflomal
inside your installation directory, default is /usr/bin
.
Load Eflomal model#
def eflomal(preprocessing_func: Callable = None, **kwargs):
"""
load eflomal word alignment for MS-EN. Model size around ~300MB.
Parameters
----------
preprocessing_func: Callable, optional (default=None)
preprocessing function to call during loading prior file.
Using `malaya.text.function.replace_punct` able to reduce ~30% of memory usage.
Returns
-------
result: malaya.model.alignment.Eflomal
"""
Eflomal model interface inside Malaya been optimized using defaultdict
, from average ~4 seconds using https://github.com/robertostling/eflomal/blob/master/align.py become ~200 ms.
[2]:
model = malaya.alignment.ms_en.eflomal()
Align#
def align(
self,
source: List[str],
target: List[str],
model: int = 3,
score_model: int = 0,
n_samplers: int = 3,
length: float = 1.0,
null_prior: float = 0.2,
lowercase: bool = True,
debug: bool = False,
**kwargs,
):
"""
align text using eflomal, https://github.com/robertostling/eflomal/blob/master/align.py
Parameters
----------
source: List[str]
target: List[str]
model: int, optional (default=3)
Model (1 = IBM1, 2 = IBM1+HMM, 3 = IBM1+HMM+fertility).
score_model: int, optional (default=0)
(1 = IBM1, 2 = IBM1+HMM, 3 = IBM1+HMM+fertility).
n_samplers: int, optional (default=3)
Number of independent samplers to run.
length: float, optional (default=1.0)
Relative number of sampling iterations.
null_prior: float, optional (default=0.2)
Prior probability of NULL alignment.
lowercase: bool, optional (default=True)
lowercase during searching priors.
debug: bool, optional (default=False)
debug `eflomal` binary.
Returns
-------
result: Dict[List[List[Tuple]]]
"""
[3]:
left = ['Terminal 1 KKIA dilengkapi kemudahan 64 kaunter daftar masuk, 12 aero bridge selain mampu menampung 3,200 penumpang dalam satu masa.']
right = ['Terminal 1 KKIA is equipped with 64 check-in counters, 12 aero bridges and can accommodate 3,200 passengers at a time.']
[4]:
results = model.align(left, right)
results
[4]:
{'forward': [[(0, 0),
(1, 1),
(2, 2),
(3, 3),
(3, 4),
(4, 5),
(5, 6),
(7, 7),
(6, 8),
(9, 9),
(10, 10),
(11, 11),
(12, 12),
(13, 13),
(14, 14),
(15, 15),
(16, 16),
(17, 17),
(18, 18),
(19, 19)]],
'reverse': [[(0, 0),
(1, 1),
(2, 2),
(3, 3),
(4, 5),
(5, 6),
(6, 8),
(7, 7),
(9, 9),
(10, 10),
(11, 11),
(12, 12),
(13, 13),
(14, 14),
(15, 15),
(16, 16),
(17, 17),
(18, 18),
(19, 19)]]}
[5]:
for i in range(len(left)):
left_splitted = left[i].split()
right_splitted = right[i].split()
for k in results['forward'][i]:
print(i, left_splitted[k[0]], right_splitted[k[0]])
0 Terminal Terminal
0 1 1
0 KKIA KKIA
0 dilengkapi is
0 dilengkapi is
0 kemudahan equipped
0 64 with
0 daftar check-in
0 kaunter 64
0 12 12
0 aero aero
0 bridge bridges
0 selain and
0 mampu can
0 menampung accommodate
0 3,200 3,200
0 penumpang passengers
0 dalam at
0 satu a
0 masa. time.
[ ]: