Img


NEWS 2018: The Seventh Named Entities Workshop

Baseline Transliteration Model using Sequitur

Sequitur is a data-driven, state-of-the-art, open source tool. It is built on the joint-sequence model for achieving effective grapheme to phoneme conversion.

Step 1

A stable Sequitur version can be downloaded from this link. The source code can be cloned from the GitHub repository. (Note that Sequitur’s behaviour on a Windows system has been found to be problematic and inconsistent. It is advised to use a Linux machine to run Sequitur.)

Step 2

Run the attached code with the names of the training and development XML files. This extracts the training and validation word pairs from the respective files and stores them in Sequitur compatible format. If the training file is called train.xml and development corpus is dev.xml, the following command can be run to create the Sequitur compatible representation:

python3 getDataFromXml.py train.xml dev.xml

Each line in the training and validation partitions comprises of the orthographic form of the word in the source language followed by its space-separated target transliteration. If the code is run on the English to Hebrew training and development partition, the attached folder will be created.

Step 3

Once the training and development corpora are ready, we can run the Sequitur commands to train the transliteration model. Assuming that the g2p/ and corpus/ folders are in the same directory, the following sequence commands can be run:

python lib64/python/g2p.py –-encoding=UTF-8 –-train=corpus/train.txt –-devel=corpus/dev.txt
-–write-model=corpus/model1

python lib64/python/g2p.py –-encoding=UTF-8 –-train=corpus/train.txt –-devel=corpus/dev.txt
–-model=corpus/model1 -–ramp-up -–write-model=corpus/model2

python lib64/python/g2p.py –-encoding=UTF-8 –-train=corpus/train.txt –-devel=corpus/dev.txt
–-model=corpus/model2 -–ramp-up -–write-model=corpus/model3

python lib64/python/g2p.py –-encoding=UTF-8 –-train=corpus/train.txt –-devel=corpus/dev.txt
–-model=corpus/model3 -–ramp-up -–write-model=corpus/model4

python lib64/python/g2p.py –-encoding=UTF-8 –-train=corpus/train.txt –-devel=corpus/dev.txt
–-model=corpus/model4 -–ramp-up -–write-model=corpus/model5

The last command creates a 5-gram model for the given bi-lingual corpus.

Step 4

The code can be tested on the same development set to check for the accuracy of the transliterations using the following command:

python lib64/python/g2p.py –-encoding=UTF-8 –-model=corpus/model5 –-test=corpus/dev.txt

The picture below is what should be expected if the model has been correctly created for the English to Hebrew dataset.

Img

Step 5

The code can be applied to any file (consisting only of source words) using the following command:

python lib64/python/g2p.py –-encoding=UTF-8 –-model=corpus/model5 –-apply=corpus/test.txt