MS-EN alignment using Eflomal#

This tutorial is available as an IPython notebook at Malaya/example/alignment-ms-en-eflomal.

[1]:
%%time
import malaya
CPU times: user 5.44 s, sys: 1.07 s, total: 6.51 s
Wall time: 8.71 s

What is Eflomal?#

Originally from https://github.com/robertostling/eflomal, a great tool for word alignment task, using probability based.

Installation#

If you are using Linux / Windows, you need to compile the binary from source, https://github.com/robertostling/eflomal,

git clone https://github.com/robertostling/eflomal && cd eflomal
make
sudo make install
python3 setup.py install

You should see eflomal inside your installation directory, default is /usr/bin.

Installation for Mac#

If you are using Mac, you need to compile the binary from source, https://github.com/huseinzol05/maceflomal,

git clone https://github.com/huseinzol05/maceflomal && cd maceflomal
export CC=/usr/local/bin/gcc-11
make
sudo make install
python3 setup.py install

You should see eflomal inside your installation directory, default is /usr/bin.

Load Eflomal model#

def eflomal(preprocessing_func: Callable = None, **kwargs):
    """
    load eflomal word alignment for MS-EN. Model size around ~300MB.

    Parameters
    ----------
    preprocessing_func: Callable, optional (default=None)
        preprocessing function to call during loading prior file.
        Using `malaya.text.function.replace_punct` able to reduce ~30% of memory usage.

    Returns
    -------
    result: malaya.model.alignment.Eflomal
    """

Eflomal model interface inside Malaya been optimized using defaultdict, from average ~4 seconds using https://github.com/robertostling/eflomal/blob/master/align.py become ~200 ms.

[2]:
model = malaya.alignment.ms_en.eflomal()

Align#

def align(
    self,
    source: List[str],
    target: List[str],
    model: int = 3,
    score_model: int = 0,
    n_samplers: int = 3,
    length: float = 1.0,
    null_prior: float = 0.2,
    lowercase: bool = True,
    debug: bool = False,
    **kwargs,
):
    """
    align text using eflomal, https://github.com/robertostling/eflomal/blob/master/align.py

    Parameters
    ----------
    source: List[str]
    target: List[str]
    model: int, optional (default=3)
        Model (1 = IBM1, 2 = IBM1+HMM, 3 = IBM1+HMM+fertility).
    score_model: int, optional (default=0)
        (1 = IBM1, 2 = IBM1+HMM, 3 = IBM1+HMM+fertility).
    n_samplers: int, optional (default=3)
        Number of independent samplers to run.
    length: float, optional (default=1.0)
        Relative number of sampling iterations.
    null_prior: float, optional (default=0.2)
        Prior probability of NULL alignment.
    lowercase: bool, optional (default=True)
        lowercase during searching priors.
    debug: bool, optional (default=False)
        debug `eflomal` binary.

    Returns
    -------
    result: Dict[List[List[Tuple]]]
    """
[3]:
left = ['Terminal 1 KKIA dilengkapi kemudahan 64 kaunter daftar masuk, 12 aero bridge selain mampu menampung 3,200 penumpang dalam satu masa.']
right = ['Terminal 1 KKIA is equipped with 64 check-in counters, 12 aero bridges and can accommodate 3,200 passengers at a time.']
[4]:
results = model.align(left, right)
results
[4]:
{'forward': [[(0, 0),
   (1, 1),
   (2, 2),
   (3, 3),
   (3, 4),
   (4, 5),
   (5, 6),
   (7, 7),
   (6, 8),
   (9, 9),
   (10, 10),
   (11, 11),
   (12, 12),
   (13, 13),
   (14, 14),
   (15, 15),
   (16, 16),
   (17, 17),
   (18, 18),
   (19, 19)]],
 'reverse': [[(0, 0),
   (1, 1),
   (2, 2),
   (3, 3),
   (4, 5),
   (5, 6),
   (6, 8),
   (7, 7),
   (9, 9),
   (10, 10),
   (11, 11),
   (12, 12),
   (13, 13),
   (14, 14),
   (15, 15),
   (16, 16),
   (17, 17),
   (18, 18),
   (19, 19)]]}
[5]:
for i in range(len(left)):
    left_splitted = left[i].split()
    right_splitted = right[i].split()
    for k in results['forward'][i]:
        print(i, left_splitted[k[0]], right_splitted[k[0]])
0 Terminal Terminal
0 1 1
0 KKIA KKIA
0 dilengkapi is
0 dilengkapi is
0 kemudahan equipped
0 64 with
0 daftar check-in
0 kaunter 64
0 12 12
0 aero aero
0 bridge bridges
0 selain and
0 mampu can
0 menampung accommodate
0 3,200 3,200
0 penumpang passengers
0 dalam at
0 satu a
0 masa. time.
[ ]: